Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Debugging a machine check panic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Debugging a machine check panic

Subject: Re: Debugging a machine check panic
From: Terry Lambert <email@hidden>
Date: Thu, 12 Feb 2009 21:20:00 -0800

On Feb 12, 2009, at 2:46 PM, Ryan McGann wrote:

I have an NKE that intermittently, but reliably, panics in what looks like an interrupt handler: panic(cu 0 caler 0x001A8Fd4): Machine-check at 0x0014329a, thread: 0x3639790, code:0x12, registers: CR0: 0x8001003b, CR2: 0x37d0bfec, CR3: 0x0107d000, CR4: 0x0000006e0 EAX: 0x00522d08, EBX: 0x00000000, ECX: 0x00000000, EDX: 0x00001000 ESP: 0x37d0bff0, EBP: 0x37d0c068, ESI: 0x00000000, EDI: 0x00522d08 EFL: 0x00010206, EIP: 0x0019f967

I can reproduce the panic fairly easily but I cannot see in the code where it is panicing. The EIP given in the panic log is someplace in hw_simplelock OR zalloc_canblock, but never in my code (though I'm sure it's my kext causing the panic). I just don't know how to tell where it's going wrong. I don't even know what this panic indicates since I don't know exactly what machine-check is, other than some sort of interrupt. My guess is stack overflow or stack corruption, but I could be dead wrong. The ESP and CR2 registers may be inside kext, since showallkmods shows my kext is loaded at 0x2bc7a000, though that'd be a large jump (the kext is 800k).

I've done all the usual debugging that I can think of but two machine debugging is pretty useless since the backtrace is not in my code, and printf can only tell me the last function in my code that is called, not any kernel functions I call, (plus it's not as reliable as one would hope). I've commented out a lot of code to the point that it doesn't panic anymore, but also doesn't do anything useful anymore. So I'm pretty much at a dead end.

Anybody have any other ideas on where to begin, like what machine- check means?


Definition:

	<http://en.wikipedia.org/wiki/Machine_Check_Exception>

...basically, it thinks you have a hardware problem. There are typically two causes for something like this:

(1) You have a hardware problem (usually, bad RAM that can't clock up fast enough - within manufacturers spec - after it's been clocked down during a power management operation)

(2) You have software that's stomping saved state information that, when restored, causes your hardware to barf

(3) You are trying to change the contents of MSRs, not realizing that these are adjusted dynamically by the OS for architectural reasons, and must be left alone by third parties

...I'm going to guess #2 for this one, given the information you've provided so far.


Using two machine debugging, you can enter the kgmacros command:

	(gdb) showMCAstate

...to dump the machine check state for your CPU(s).

Then you can build a copy of mcelog from:

	<http://freshmeat.net/projects/mcelog/>

...and shovel the register data through there, and hope it tells you something Andi Kleen was able to figure out from the architecture manuals, excluding the Intel NDA'd information, so you can figure out what you're causing to be stomped.

Alternately you can file a bug report, and hope you can gather enough information that whoever it lands on can do this for you, or you could probably find the problem if you were to go through your NKE with a fine tooth comb doing a source inspection for logic errors.

-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden



References:  
  >Debugging a machine check panic (From: Ryan McGann <email@hidden>)




Prev by Date:
How Mutexes Work

Next by Date:
Re: How Mutexes Work

Previous by thread:
Debugging a machine check panic

Next by thread:
How Mutexes Work

Index(es):

Date
Thread