Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Understanding cores...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Understanding cores...

Subject: Re: Understanding cores...
From: Terry Lambert <email@hidden>
Date: Thu, 11 Jan 2007 05:30:30 -0800

On Jan 10, 2007, at 11:05 PM, Andreas Fink wrote:

On 09.01.2007, at 19:28, Derek Kumar wrote:
On Jan 8, 2007, at 4:05 PM, Michael Tuexen wrote:
Well on that system the NKE is load always, because it is required by the application running on that Mac Pro. In the meantime another system was setup (on a different hardware) the it also crashes a lot. BTW: A lot means a couple of times per day. And yes, the cores we get (using a core dump server), are all like this (some pointed to bugs in the NKE in the past, but these could be fixed, then also SCTP.kext was explicitly mentioned in the paniclog).

Any idea how to narrow down the problem?
If the EIP values in the "paniclog" register dump are identical/ similar across all the crashes you've observed (note that "unresolved kernel trap" is just a generic label), and only occur when your driver is loaded, it's likely to be memory corruption as I noted previously. Is it always an EBP based access (typically a local or parameter) in idle_thread() that's causing the fault? The loop in idle_thread() briefly enables interrupts and disables them, so if you have an interrupt filter routine (that executes at interrupt context) that could be another point where corruption could occur (in addition to the the saved context at the base of the thread's kernel stack I mentioned previously--corruption of the register context below the interrupt stack frame that contains the saved value of the EBP register, for instance). Unfortunately, there's no single magic bullet when it comes to identifying sources of memory corruption of this type--determining the patterns and location of corruption and binary search via logging/tracing is one approach (after carefully walking through your code to look for erroneous stores to memory, bad DMA bounds, stack overflows etc.; I don't think page protection/debug register type schemes to trap the bad store (assuming it's not a physical mode store) would be useful here since the register context would be very frequently accessed. Logic analyzers (very expensive) would be a last resort). The kernel trace facility (/usr/local/bin/trace - h) can tell you what events (such as interrupts and context switches) occurred on that processor, but given that it panics, you'd probably have to examine the trace buffer in memory (see xnu/ bsd/kern/kdebug.c in the kernel sources for the internals of the trace facility) to extract the last few trace events.

Derek
Hello Derek,
We now had the same crash on a XServe dual G5 which looks like the one below. We think to have spotted the real culprit in the meantime but its like fishing in the dark and wild guessing and hoping to have it fixed. But as we see several crashdumps a day from different machines, we will know pretty soon if we found it or not. But we want to be sure ;-). So is there anything useable we can follow on this gdb output which could give us some hints?

Typically, you just have to design your code so that it either fails safe, or it fails locally. If you end up corrupting memory, or end up walking off a pointer into a non existent address, or an address of something that used to be allocated, but is now freed, you will get either memory corruption (if you happen to hit something that's there, or if a freed area is reused for another purpose), or you crash with a fault in kernel mode.

In general, that's why all the documentation asks you to consider if you can write your code in user space instead of the kernel.

When you are programming in the kernel, it is the same as programming in any other kernel, or in a single address space embedded system (e.g. DOS or some other non-protected mode OS).

There are a number of techniques you can learn to prevent this biting you, or when it bites you, from biting too hard, but it's typically just a skill you have to master like any other skill. For example, if you allocate memory, make sure you create a zone for it and allocate from that. This makes the memory type-stable, so that f anything gets hosed, then your search area is limited to the code that touches the type of memory that got hosed. Another example is to scribble over freed objects before returning them to free lists, so that if you attempt to use the object after it's freed, you will end up with a very identifiable and distinctive error, rather than some random error. You can also use "corrupting frees".

An example of a "corrupting free" would be if you had an object pointer, and used it like this:


	objptr_t foo;

	...
	foo = new_foo();
	...
	discard_foo(foo);
	...
	foo->xxx = yyy;		/* Use after free!!!!! */

How would you change the design of this API to prevent the use after free?

The easiest way would be to change discard_foo()'s definition and use... currently, the way it's used, it's apparent that we are doing:

void
discard_foo(objptr_t ptr)
{
	...
	free(ptr);	/* or whatever free function we use */
	...
}

This is inherently unsafe; but say we change it to:

	...
	discard(&foo);
	...

and define it as:

void
discard_foo objptr_t *ptrptr)
{
	objptr_t ptr = *ptrptr;

	*ptrptr = (void *)(7);	/* characteristic illegal value */
	...
	free(ptr);	/* or whatever free function we use */
	...
}

Now if we attempt to continue to use foo, we are going to see a crash on an odd address boundary, and we know that we allocate all our things at even addresses. So now we know we are continuing to use a pointer after freeing it.

Further, since we guarantee that the pointer immediately goes out of scope, if it's a pointer in a structure, and is therefore globally visible, we are guaranteed that there won't be an interrupt routine or another thread accessing it out of the globally visible space after it's on its way to being freed, since if that happens, then whoever attempts to access it will crash.

So we get several types of safety protection, pretty much for free, just by changing how we define our API.

We might also decide to boot with cpus=1 on an SMP system, to make sure that any locking we forget to do against reentrancy won't bite us. This boots to multiuser with only a single CPU active. If you have race conditions, and you are seeing crashes, and you boot with a single CPU and all your problems mysteriously disappear, then it's a sure bet that you've failed to deal with some concurrency issue. Maybe you have a data structure that's used in multiple places, and you don't hold a mutex to serialize access to it. Or maybe you dereference part of a structure to et a flags value, then you perform a potentially blocking operation, someone else comes along while you are asleep and changes the flags out from under you, and then you make a bogus decision on the basis of the previous flags value, rather than it's current value. Or maybe you just forgot to mark a global pointer as "volatile" and are changing its value from another thread, and since it's being used in a loop, the compiler optimized it into a register, so its value never changes as far as the code is concerned (because you failed to give the optimization hint "volatile" to the compiler to prevent that).

These are just a couple of examples of what you can do in your kernel code to make debugging problems easier, but there are lots of others.

In general, and good programming text on embedded systems or kernel programming, or even programming for non-protected mode OS's, like DOS or very old version of MacOS, or in the Linux or FreeBSD kernel, etc., wil go into more detail than I can go into here.

Then we are off to do the same under Leopard. (oh by the way, where can we checkout the sources of current leopard beta so we can verify our design?).

You can't get the sources. We generally do not release xnu sources until after release, for reasons which should be obvious, if you think about them a little bit. If you need to test the KEXT and you are eligible for seeds (see <http://developer.apple.,com>), you can install one of the seed builds and load your KEXT into a binary seed kernel for testing.

-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: Understanding cores...
From: Andreas Fink <email@hidden>


References:  
  >Understanding cores... (From: Michael Tuexen <email@hidden>)
  >Re: Understanding cores... (From: "Brian Bechtel" <email@hidden>)
  >Re: Understanding cores... (From: Derek Kumar <email@hidden>)
  >Re: Understanding cores... (From: Michael Tuexen <email@hidden>)
  >Re: Understanding cores... (From: Derek Kumar <email@hidden>)
  >Re: Understanding cores... (From: Andreas Fink <email@hidden>)




Prev by Date:
kevent problems at exit

Next by Date:
Re: TCP, and clamping MSS..?

Previous by thread:
Re: Understanding cores...

Next by thread:
Re: Understanding cores...

Index(es):

Date
Thread