site_archiver@lists.apple.com Delivered-To: darwin-kernel@lists.apple.com On Jan 17, 2006, at 1:37 PM, James Reynolds wrote: Let's see if going over this backtrace here helps at all. Thu Dec 15 20:03:23 2005 msmith% gdb /mach_kernel (gdb) x/i 0x0002A138 0x2a138 <ipc_task_enable+56>: lwz r0,88(r1) (gdb) x/i 0x00037960 0x37960 <task_set_64bit+68>: lwz r2,20(r31) (gdb) 0x37964 <task_set_64bit+72>: lwz r3,44(r2) (gdb) x/i 0x00265D14 0x265d14 <waitid+260>: addi r3,r1,64 = Mike _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-kernel mailing list (Darwin-kernel@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-kernel/site_archiver%40lists.a... So it has been a few months since I read that. Rereading it after having read a lot of other stuf helps a little more. But in reality, it raises more questions and answers only the one I had about what PC means. I'm sorry that this isn't easy. Sadly, it's not just understanding the backtraces that's complicated; they're not the only barrier to entry for this particular game. The good news is that practice is all it takes, along with a willingness to keep learning. But only slightly, as I'm still not sure exactly what it means, but I think I have found another bug. In examining 4 different machine panic logs within the last month, they are very similar. Anyway, 2 machines have a near exact logs. The DAR value, R1 value, and Exception states (sv) are the only things different, but the PC and backtraces are the same: Unresolved kernel trap(cpu 0): 0x600 - Alignment DAR=0x0000000001BCBEAE PC=0x00000000000A4F20 We took an alignment exception attempting to access the address 0x0000000001BCBEAE. It's only 16-bit aligned, so it was probably a 32- or 64-bit access. The instruction that tried to do this was at 0x00000000000A4F20. In addition, we took this while we were in the kernel. Latest crash info for cpu 0: Exception state (sv=0x23E24500) PC=0x000A4F20; MSR=0x00009030; DAR=0x01BCBEAE; DSISR=0x0A000000; LR=0x000750C0; R1=0x0CF13D00; XCP=0x00000018 (0x600 - Alignment) Backtrace: 0x0002A138 0x00037960 0x00265D14 0x00265F50 0x00265E30 0x002A8494 You can convert these back into function names using gdb. Here I am doing it against the wrong kernel version: etc. Again, this was against the wrong kernel version, so don't take the above as gospel. If you grab the debug kernel and the Darwin sources, you can get line number information and perhaps track it down to a small fragment of code. Once you've got a likely suspect, you're getting somewhere. Once you know what's failing, you need to look for causes. In your case here, where you see the same specific failure repeating over and over the cause is likely to be something with a deterministic consequence. You can rule out things like random memory corruption; instead you're looking for something with a small but nonzero chance of happening, often you'll be looking for a race condition or degenerate behaviour in the face of an unexpected resource shortage. For the specific example here, you almost certainly have a bad data or function pointer. Once you've found the code in question, you need to look at which pointer(s) it's attempting to dereference, and then at who might have changed them recently. One of those is likely to be your culprit. This email sent to site_archiver@lists.apple.com