Dual-CPU Kernel Panic
Dual-CPU Kernel Panic
- Subject: Dual-CPU Kernel Panic
- From: "Stockhoff, Cory" <email@hidden>
- Date: Thu, 20 Mar 2003 08:50:21 -0800
Hello Everybody,
I apologize for the length of this post, but I hoped to provide enough
info to give somebody an "aha!" moment.
Our product consists of user and kernel components implementing a VPN
client. The kernel component is a KEXT that processes inbound and outbound
IP packets intercepted by a DLIL. The DLIL is pushed onto every network
device that becomes active.
The crash we are seeing is a kernel panic that occurs only on multiple-CPU
hosts, and is infrequent. Our KEXT tries to allocate some kernel memory by
calling "kmem_alloc()". The panic message is always "panic(cpu [0 or 1]):
thread_invoke: preemption_level 1". The most reliable way to duplicate the
problem is to engage the software (i.e. connect to a VPN switch via an
encrypted VPN tunnel), mount a remote shared volume (AppleTalk over IP), and
do heavy reads and writes to the volume.
Our customers have seen the problem on both Jaguar (10.2.x) and Puma
(10.1.x) OS's. Our testing and debugging has been on Jaguar hosts, 10.2.3
(Darwin 6.3). Our application is built on Puma.
Various checks have been tried, none successful:
1) Don't call kmem_alloc() if preemption level is 1. It turns out that
the preemption level is often 1 during normal operation, and is
usually not a problem. However, not returning allocated memory
every time preemption level is 1 does cause problems for our software.
2) Tried a thread_funnel_switch(). No funnel was being held by the thread
at the time, so this caused its own panic.
3) Examined the size of the requested kernel block. The block requested
when the panic occurs does not look unusual. The requested sizes seen
were anywhere from 10 to over 1000 bytes during normal operation.
The request that caused one panic was about 700 bytes.
The stack trace can vary a bit, but this one is typical (and from Jaguar):
0x000856cc print_backtrace+176
0x00085afc Debugger+108
0x000287a8 panic+488
0x00033eec thread_invoke+72
0x000344d0 thread_block_reason+212
0x0008d51c mlInUse+16
0x00060b60 vm_fault_wire_fast+284
0x00064f88 vm_map_wire_nested+2988
0x000651a0 vm_map_wire+120
0x00061a78 kernel_memory_allocate+600
Looking through the Darwin source, the calls appear as:
- vm_fault_wire_fast
- mutex(&vm_page_queue_lock)
- mlInUse
- mutex_lock_wait
- thread_sleep_interlock
- assert_wait
- interlock_unlock
- thread_block
- thread_block_reason (continuation, AST_NONE)
- thread_invoke
- panic("thread_invoke: preemption_level %d\n",
cpu_data[cpu_number()].preemption_level);
I don't quite understand how the preemption level gets set in this case, but
it looks like it is done in the "interlock_unlock" function, and that
"thread_sleep_interlock" shouldn't be calling both "interlock_unlock" and
"thread_block".
Whether or not I'm correct, is there a workaround for this panic that can be
implemented in my code? Or, if it's an OS problem, will this be fixed in a
future release (or is it already fixed in 10.2.4)?
Cheers,
Cory
_______________________________________________
darwin-kernel mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/darwin-kernel
Do not post admin requests to the list. They will be ignored.