Dual-CPU Kernel Panic

20 Mar 2003

      Hello Everybody,

I apologize for the length of this post, but I hoped to provide enough

info to give somebody an "aha!" moment.

Our product consists of user and kernel components implementing a VPN

client.  The kernel component is a KEXT that processes inbound and outbound

IP packets intercepted by a DLIL.  The DLIL is pushed onto every network

device that becomes active.

The crash we are seeing is a kernel panic that occurs only on multiple-CPU

hosts, and is infrequent.  Our KEXT tries to allocate some kernel memory by

calling "kmem_alloc()".  The panic message is always "panic(cpu [0 or 1]):

thread_invoke: preemption_level 1".  The most reliable way to duplicate the

problem is to engage the software (i.e. connect to a VPN switch via an

encrypted VPN tunnel), mount a remote shared volume (AppleTalk over IP), and

do heavy reads and writes to the volume.

Our customers have seen the problem on both Jaguar (10.2.x) and Puma

(10.1.x) OS's.  Our testing and debugging has been on Jaguar hosts, 10.2.3

(Darwin 6.3).  Our application is built on Puma.

Various checks have been tried, none successful:

  1) Don't call kmem_alloc() if preemption level is 1.  It turns out that

     the preemption level is often 1 during normal operation, and is

     usually not a problem.  However, not returning allocated memory

     every time preemption level is 1 does cause problems for our software.

  2) Tried a thread_funnel_switch().  No funnel was being held by the thread

     at the time, so this caused its own panic.

  3) Examined the size of the requested kernel block.  The block requested

     when the panic occurs does not look unusual.  The requested sizes seen

     were anywhere from 10 to over 1000 bytes during normal operation.

     The request that caused one panic was about 700 bytes.

The stack trace can vary a bit, but this one is typical (and from Jaguar):

0x000856cc print_backtrace+176

0x00085afc Debugger+108

0x000287a8 panic+488

0x00033eec thread_invoke+72

0x000344d0 thread_block_reason+212

0x0008d51c mlInUse+16

0x00060b60 vm_fault_wire_fast+284

0x00064f88 vm_map_wire_nested+2988

0x000651a0 vm_map_wire+120

0x00061a78 kernel_memory_allocate+600

Looking through the Darwin source, the calls appear as:

- vm_fault_wire_fast

  - mutex(&vm_page_queue_lock)

    - mlInUse

      - mutex_lock_wait

        - thread_sleep_interlock

          - assert_wait

          - interlock_unlock

          - thread_block

            - thread_block_reason (continuation, AST_NONE)

              - thread_invoke

                - panic("thread_invoke: preemption_level %d\n",

                         cpu_data[cpu_number()].preemption_level);

I don't quite understand how the preemption level gets set in this case, but

it looks like it is done in the "interlock_unlock" function, and that

"thread_sleep_interlock" shouldn't be calling both "interlock_unlock" and

"thread_block".

Whether or not I'm correct, is there a workaround for this panic that can be

implemented in my code?  Or, if it's an OS problem, will this be fixed in a

future release (or is it already fixed in 10.2.4)?

Cheers,

Cory

_______________________________________________

darwin-kernel mailing list | darwin-kernel@lists.apple.com

Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-kernel

Do not post admin requests to the list. They will be ignored.

Stockhoff, Cory

tags

participants (1)