Re: Understanding cores...
Re: Understanding cores...
- Subject: Re: Understanding cores...
- From: Terry Lambert <email@hidden>
- Date: Thu, 11 Jan 2007 15:09:26 -0800
On Jan 11, 2007, at 8:11 AM, Andreas Fink wrote:
Typically, you just have to design your code so that it either
fails safe, or it fails locally. If you end up corrupting memory,
or end up walking off a pointer into a non existent address, or an
address of something that used to be allocated, but is now freed,
you will get either memory corruption (if you happen to hit
something that's there, or if a freed area is reused for another
purpose), or you crash with a fault in kernel mode.
Thats nothing new for me. Actually most of the techniques you do
there I already use in my own code. The KEXT we are looking at is
however not developed by me but a bunch of very experienced kernel
developers. It is very well working code written by very well
experienced bsd developers (its actually code which is part of the
OpenBSD kernel by now) and we seem to have a crash which is happen
regularly on my setup but not on others. So we have some kind of
race condition or so which is very hard to track down. And every
time it goes down, we get 750MB of useless core dump unless someone
can give me at least a glimpse if there's anything in there pointing
us to the right direction somehow. I know this is a hard problem to
track down but any hints are welcome. In the meantime we are
continuing hunting improving the code on the way. We'll find it!
My personal guess is that if it's an NKE, and the problem is only on
Intel, you are running into a byte-order issue for trapping on the way
in vs. out. Given the type of problem, I'd say that the issue is one
of the packet header length field, rather than a checksum issue, so
it's likely an IP header problem. But that's just a guess based on
what you've said so far.
We might also decide to boot with cpus=1 on an SMP system, to make
sure that any locking we forget to do against reentrancy won't bite
us.
Happens on single CPU as well. That has been tested. XServe G4
single CPU, XServe Dual G5, MacPro 4xIntel, MacPro forced to single
CPU, MacMini Dual CPU have all been seen crashing in the past.
Nowadays the PPC's crash less than the intels but it has been
opposite too a while ago (but we found a few other glitches on the
way and fixed them too).
We don't see any difference if the calls are synchronously,
asynchronously, using poll or select, running in a single thread or
multiple threads concurrently. Last traceback gave us some hint that
it was somewhere in proto_delayed_inject() trying to lock a mutex
(entry->domain->dom_mtx)
If this is true, then it's not a CPU reentrancy into the code race,
but it could still be a concurrency of access race, or it could be a
timer event, etc., while something is blocked. It's likely something
about persistent state being assumed not to change over a blocking
operation.
Then we are off to do the same under Leopard. (oh by the way,
where can we checkout the sources of current leopard beta so we
can verify our design?).
You can't get the sources. We generally do not release xnu sources
until after release, for reasons which should be obvious, if you
think about them a little bit. If you need to test the KEXT and
you are eligible for seeds (see <http://developer.apple.,com>), you
can install one of the seed builds and load your KEXT into a binary
seed kernel for testing.
Well I have a binary of Leopard since WWDC2006. But that doesn't
help for this KEXT as some internal structures are required to adapt
it for Leopard. It loads fine but doesn't do its job. I've been
asking Apple to make a public API out of it so we are not kernel
version dependent but so far everyone is too busy to get Leopard out
that it gets postponed all the time (I'm asking this since 2004
including personal visits to WWDC 2005 and WWDC 2006 and WWDC 2007
and...).
From my readings of the publicly visible responses in the radars you
pointed to, it would be possible to implement the APIs yourself in
Darwin and request their inclusion in MacOS X. This would also permit
the near-term workaround of providing a replacement Darwin kernel for
use by your customers as an interim solution until your APIs or
variants end up getting adopted, or there's some other way made
available to you to do the same thing. Anything else to do with thos
radars probably needs handling in the context of those radars
themselves.
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden