Re: strange panic, debugging help wanted..

16 Feb 2005

      site_archiver@lists.apple.com
Delivered-To: darwin-kernel@lists.apple.com

Andrew Gallatin writes:
...
Godfrey van der Linden writes:
...
I'd like to see the assembly that this kernel is running  at 0x2CAE8.
I'd be willing to bet an offset from a NULL pointer is being taken and
that is why you are panicing.
I think the 'r1' panic is a red herring, the first exception state is
'PC=0x0002CAE8; MSR=0x00001030; DAR=0x000000D4; DSISR=0x40000000;
LR=0x0002 CAD8; R1=0x0CC33DB0; XCP=0x0000000C (0x300 - Data access)'
Indicates that the r1 is valid at the time that the panic is taken.  Do
you have a symbolled kernel for the version that is taking the panic.
If you can find out what routine was passed a NULL pointer you may have
a suspect.
I've just found this crash waiting for me again this morning, and I
managed to connect to it via remote gdb.  This is with 10.3.7, though
the PC is the same as it was for the much earlier version of MacOSX.
The code seems to be in a totally different spot, which means maybe
I totally botched converting the address -> symbols last time.

(gdb) paniclog

Unresolved kernel trap(cpu 1): 0x300 - Data access DAR=0x00000000000000D4 PC=0x000000000002CAE8
Latest crash info for cpu 1:
   Exception state (sv=0x22D43000)
      PC=0x0002CAE8; MSR=0x00001030; DAR=0x000000D4; DSISR=0x40000000; LR=0x0002CAD8; R1=0x0CC5BDB0; XCP=0x0000000C (0x300 - Data access)
      Backtrace:
         0x0002CAD8 0x0002C8A8 0x0002C870
Proceeding back via exception chain:
   Exception state (sv=0x22D43000)
      previously dumped as "Latest" state. skipping...
   Exception state (sv=0x00A01500)
      PC=0x00000000; MSR=0x0000D030; DAR=0x00000000; DSISR=0x00000000; LR=0x00000000; R1=0x00000000; XCP=0x00000000 (Unknown)

Kernel version:
Darwin Kernel Version 7.7.0:
Sun Nov  7 16:06:51 PST 2004; root:xnu/xnu-517.9.5.obj~1/RELEASE_PPC

[ CPU 0 then panics because a simple lock acquisition times out]

The symbolic bactrace is:
(gdb) bt
#0  0x0002cae8 in do_thread_scan () at /SourceCache/xnu/xnu-517.9.5/osfmk/kern/sched_prim.c:2790
#1  0x0002c8a8 in sched_tick_thread_continue () at /SourceCache/xnu/xnu-517.9.5/osfmk/kern/sched_prim.c:2671

Frame 0 works out to be:

                                thread = processor->idle_thread;
                                if (thread->sched_stamp != sched_tick) {
                                        if (stuck_count == MAX_STUCK_THREADS) {
                                                restart_needed = TRUE;
                                                break;
                                        }

And 0xd4 is the offset of sched_stamp:
(gdb) p ((thread_t) 0)->sched_stamp
Cannot access memory at address 0xd4

So it looks like the idle thread got zeroed out somehow.  This seems
to be confirmed by other sources:

(gdb) info locals
restart_needed = 9012778
thread = 0x0
pset = 0x331800
processor = 0x1239948
s = 1

The processor struct looks like this.

(gdb) p/x *processor
$9 = {processor_queue = {next = 0x337478, prev = 0x331814}, state = 0x33705c,
  active_thread = 0x0, next_thread = 0x0, idle_thread = 0x0,
  processor_set = 0x1239948, current_pri = 0x80000000, quantum_timer = {q_link = {
      next = 0x0, prev = 0x0}, func = 0x0, param0 = 0x0, param1 = 0x0,
    deadline = 0x0, state = 0x266450}, quantum_end = 0x0, last_dispatch = 0xcb18000,
  timeslice = 0x104, deadline = 0x2400000080, runq = {highq = 0x80, bitmap = {0x5f,
<...>

This looks corrupt, and looking at the pset the processor address
seems funny.  Eg:

(gdb) p *pset
$13 = {idle_queue = {next = 0x331800, prev = 0x331800}, idle_count = 0,
  active_queue = {next = 0x337000, prev = 0x33748c}, processors = {next = 0x1239948,
    prev = 0x33748c}, processor_count = 2, sched_lock = {lock_data = 19103745},
  runq = {highq = 0, bitmap = {0, 0, 0, 1}, count = 0, urgency = 0, queues = {{
        next = 0x331840, prev = 0x331840}, {next = 0x331848, prev = 0x331848}, {
<...>

Eg, all the addresses here are in the 0x33xxxx range, so processor = 0x1239948
seems rather suspicious.   The other one looks more reasonable:

$15 = (struct processor *) 0x33748c
(gdb) p *(processor_t)0x33748c
$16 = {processor_queue = {next = 0x33180c, prev = 0x337000}, state = 1,
  active_thread = 0x1238000, next_thread = 0x0, idle_thread = 0x1238948,
  processor_set = 0x331800, current_pri = 95, quantum_timer = {q_link = {
      next = 0x135ad6c, prev = 0x320ed0}, func = 0x28738 <thread_quantum_expire>,
    param0 = 0x33748c, param1 = 0x1238000, deadline = 37553127975012,
    state = DELAYED}, quantum_end = 37553127975012, last_dispatch = 37553127641687,
  timeslice = 1, deadline = 18446744073709551615, runq = {highq = 0, bitmap = {0, 0,
<...>

But that 0x1238948 address is showing up here too (for the idle thread).
I have no idea what is supposed to be there.  Dumping the contents
don't mean much to me:

(gdb) x/32 0x1238948
0x1238948 <mhp.0+15366720>:     0x00000000      0x00000000      0x00000000      0x00000000
0x1238958 <mhp.0+15366736>:     0x00000000      0x00000000      0x01238948      0x80000000
0x1238968 <mhp.0+15366752>:     0x00000000      0x00000000      0x00000000      0x00000000
0x1238978 <mhp.0+15366768>:     0x00000008      0xffffffff      0x00000000      0x0002c490
0x1238988 <mhp.0+15366784>:     0x00000000      0x00000000      0x00000000      0x0cd30000
0x1238998 <mhp.0+15366800>:     0x00000184      0x00000004      0x00000000      0x00000000
0x12389a8 <mhp.0+15366816>:     0x0000005f      0x00000051      0x00000000      0x00000000
0x12389b8 <mhp.0+15366832>:     0x00000000      0x00000000      0x0000000e      0x00000000

If anybody has any ideas,  I'll leave this in gdb for a while..

Thanks,

Drew
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (Darwin-kernel@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/darwin-kernel/site_archiver%40lists.a...

This email sent to site_archiver@lists.apple.com