Re: Questions about debugging kernel panics
Re: Questions about debugging kernel panics
- Subject: Re: Questions about debugging kernel panics
- From: Mike Smith <email@hidden>
- Date: Wed, 18 Jan 2006 11:18:05 -0800
On Jan 17, 2006, at 1:37 PM, James Reynolds wrote:
So it has been a few months since I read that. Rereading it after
having read a lot of other stuf helps a little more. But in
reality, it raises more questions and answers only the one I had
about what PC means.
I'm sorry that this isn't easy. Sadly, it's not just understanding
the backtraces that's complicated; they're not the only barrier to
entry for this particular game. The good news is that practice is
all it takes, along with a willingness to keep learning.
Let's see if going over this backtrace here helps at all.
But only slightly, as I'm still not sure exactly what it means, but
I think I have found another bug. In examining 4 different machine
panic logs within the last month, they are very similar. Anyway,
2 machines have a near exact logs. The DAR value, R1 value, and
Exception states (sv) are the only things different, but the PC and
backtraces are the same:
Thu Dec 15 20:03:23 2005
Unresolved kernel trap(cpu 0): 0x600 - Alignment
DAR=0x0000000001BCBEAE PC=0x00000000000A4F20
We took an alignment exception attempting to access the address
0x0000000001BCBEAE. It's only 16-bit aligned, so it was probably a
32- or 64-bit access. The instruction that tried to do this was at
0x00000000000A4F20. In addition, we took this while we were in the
kernel.
Latest crash info for cpu 0:
Exception state (sv=0x23E24500)
PC=0x000A4F20; MSR=0x00009030; DAR=0x01BCBEAE;
DSISR=0x0A000000; LR=0x000750C0; R1=0x0CF13D00; XCP=0x00000018
(0x600 - Alignment)
Backtrace:
0x0002A138 0x00037960 0x00265D14 0x00265F50 0x00265E30 0x002A8494
You can convert these back into function names using gdb. Here I am
doing it against the wrong kernel version:
msmith% gdb /mach_kernel
(gdb) x/i 0x0002A138
0x2a138 <ipc_task_enable+56>: lwz r0,88(r1)
(gdb) x/i 0x00037960
0x37960 <task_set_64bit+68>: lwz r2,20(r31)
(gdb)
0x37964 <task_set_64bit+72>: lwz r3,44(r2)
(gdb) x/i 0x00265D14
0x265d14 <waitid+260>: addi r3,r1,64
etc. Again, this was against the wrong kernel version, so don't take
the above as gospel. If you grab the debug kernel and the Darwin
sources, you can get line number information and perhaps track it
down to a small fragment of code. Once you've got a likely suspect,
you're getting somewhere.
Once you know what's failing, you need to look for causes. In your
case here, where you see the same specific failure repeating over and
over the cause is likely to be something with a deterministic
consequence. You can rule out things like random memory corruption;
instead you're looking for something with a small but nonzero chance
of happening, often you'll be looking for a race condition or
degenerate behaviour in the face of an unexpected resource shortage.
For the specific example here, you almost certainly have a bad data
or function pointer. Once you've found the code in question, you
need to look at which pointer(s) it's attempting to dereference, and
then at who might have changed them recently. One of those is likely
to be your culprit.
= Mike
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden