Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Understanding cores...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Understanding cores...

Subject: Re: Understanding cores...
From: Terry Lambert <email@hidden>
Date: Thu, 11 Jan 2007 15:09:26 -0800

On Jan 11, 2007, at 8:11 AM, Andreas Fink wrote:

Typically, you just have to design your code so that it either fails safe, or it fails locally. If you end up corrupting memory, or end up walking off a pointer into a non existent address, or an address of something that used to be allocated, but is now freed, you will get either memory corruption (if you happen to hit something that's there, or if a freed area is reused for another purpose), or you crash with a fault in kernel mode.
Thats nothing new for me. Actually most of the techniques you do there I already use in my own code. The KEXT we are looking at is however not developed by me but a bunch of very experienced kernel developers. It is very well working code written by very well experienced bsd developers (its actually code which is part of the OpenBSD kernel by now) and we seem to have a crash which is happen regularly on my setup but not on others. So we have some kind of race condition or so which is very hard to track down. And every time it goes down, we get 750MB of useless core dump unless someone can give me at least a glimpse if there's anything in there pointing us to the right direction somehow. I know this is a hard problem to track down but any hints are welcome. In the meantime we are continuing hunting improving the code on the way. We'll find it!

My personal guess is that if it's an NKE, and the problem is only on Intel, you are running into a byte-order issue for trapping on the way in vs. out. Given the type of problem, I'd say that the issue is one of the packet header length field, rather than a checksum issue, so it's likely an IP header problem. But that's just a guess based on what you've said so far.

We might also decide to boot with cpus=1 on an SMP system, to make sure that any locking we forget to do against reentrancy won't bite us.
Happens on single CPU as well. That has been tested. XServe G4 single CPU, XServe Dual G5, MacPro 4xIntel, MacPro forced to single CPU, MacMini Dual CPU have all been seen crashing in the past. Nowadays the PPC's crash less than the intels but it has been opposite too a while ago (but we found a few other glitches on the way and fixed them too).

We don't see any difference if the calls are synchronously, asynchronously, using poll or select, running in a single thread or multiple threads concurrently. Last traceback gave us some hint that it was somewhere in proto_delayed_inject() trying to lock a mutex (entry->domain->dom_mtx)

If this is true, then it's not a CPU reentrancy into the code race, but it could still be a concurrency of access race, or it could be a timer event, etc., while something is blocked. It's likely something about persistent state being assumed not to change over a blocking operation.

Then we are off to do the same under Leopard. (oh by the way, where can we checkout the sources of current leopard beta so we can verify our design?).
You can't get the sources. We generally do not release xnu sources until after release, for reasons which should be obvious, if you think about them a little bit. If you need to test the KEXT and you are eligible for seeds (see <http://developer.apple.,com>), you can install one of the seed builds and load your KEXT into a binary seed kernel for testing.
Well I have a binary of Leopard since WWDC2006. But that doesn't help for this KEXT as some internal structures are required to adapt it for Leopard. It loads fine but doesn't do its job. I've been asking Apple to make a public API out of it so we are not kernel version dependent but so far everyone is too busy to get Leopard out that it gets postponed all the time (I'm asking this since 2004 including personal visits to WWDC 2005 and WWDC 2006 and WWDC 2007 and...).

From my readings of the publicly visible responses in the radars you pointed to, it would be possible to implement the APIs yourself in Darwin and request their inclusion in MacOS X. This would also permit the near-term workaround of providing a replacement Darwin kernel for use by your customers as an interim solution until your APIs or variants end up getting adopted, or there's some other way made available to you to do the same thing. Anything else to do with thos radars probably needs handling in the context of those radars themselves.

-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: Understanding cores...
From: Andreas Fink <email@hidden>


References:  
  >Understanding cores... (From: Michael Tuexen <email@hidden>)
  >Re: Understanding cores... (From: "Brian Bechtel" <email@hidden>)
  >Re: Understanding cores... (From: Derek Kumar <email@hidden>)
  >Re: Understanding cores... (From: Michael Tuexen <email@hidden>)
  >Re: Understanding cores... (From: Derek Kumar <email@hidden>)
  >Re: Understanding cores... (From: Andreas Fink <email@hidden>)
  >Re: Understanding cores... (From: Terry Lambert <email@hidden>)
  >Re: Understanding cores... (From: Andreas Fink <email@hidden>)




Prev by Date:
Re: TCP, and clamping MSS..?

Next by Date:
Re: Understanding cores...

Previous by thread:
Re: Understanding cores...

Next by thread:
Re: Understanding cores...

Index(es):

Date
Thread