Re: kernel lockup
Re: kernel lockup
- Subject: Re: kernel lockup
- From: Andreas Fink <email@hidden>
- Date: Fri, 19 Apr 2013 22:34:09 +0200
On 19.04.2013, at 22:31, Pratima Kudale <email@hidden> wrote:
> Andreas,
>
> Are you seeing this issue only on 10.8.3? Or is it reproducible on any 10.8 version?
so far only on 10.8.3. I have not tried on older version.
>
> I am sharing my experience here. We are also running into n/w stack hang issue, I have bug 13138492 open for it.
I'm running into this when it panics:
however most of the time it does simply freeze and lock up.
(gdb) backtrace
#0 Debugger (message=<value temporarily unavailable, due to optimizations>) at /SourceCache/xnu/xnu-2050.22.13/osfmk/i386/AT386/model_dep.c:916
#1 0xffffff800901d626 in panic (str=<value temporarily unavailable, due to optimizations>) at /SourceCache/xnu/xnu-2050.22.13/osfmk/kern/debug.c:336
#2 0xffffff800914e542 in sa_copy () at /SourceCache/xnu/xnu-2050.22.13/bsd/net/route.c:465
#3 0xffffff800915196a in rt_lookup (lookup_only=<value temporarily unavailable, due to optimizations>, dst=0xffffff80c949bcdc, netmask=<value temporarily unavailable, due to optimizations>, rnh=<value temporarily unavailable, due to optimizations>, ifscope=0) at /SourceCache/xnu/xnu-2050.22.13/bsd/net/route.c:2681
#4 0xffffff800914e87c in rtalloc1_common_locked (dst=0xffffff80c949bcdc, report=1, ignflags=0, ifscope=<value temporarily unavailable, due to optimizations>) at /SourceCache/xnu/xnu-2050.22.13/bsd/net/route.c:876
#5 0xffffff800914e7c2 in rtalloc_ign_common_locked (ro=0xffffff80c949bcd0, ignore=0, ifscope=0) at /SourceCache/xnu/xnu-2050.22.13/bsd/net/route.c:817
#6 0xffffff800914e635 in rtalloc_ign (ro=<value temporarily unavailable, due to optimizations>, ignore=<value temporarily unavailable, due to optimizations>) at /SourceCache/xnu/xnu-2050.22.13/bsd/net/route.c:829
#7 0xffffff7f8967cc10 in ?? ()
#8 0xffffff7f896804bb in ?? ()
#9 0xffffff7f8964ad4e in ?? ()
#10 0xffffff7f8963eed3 in ?? ()
#11 0xffffff7f8963b0e9 in ?? ()
#12 0xffffff7f8963927a in ?? ()
#13 0xffffff7f8963e147 in ?? ()
#14 0xffffff7f8963e1b0 in ?? ()
#15 0xffffff80091ca28f in ip_proto_dispatch_in (m=0xffffff8a88c81c00, hlen=<value temporarily unavailable, due to optimizations>, proto=<value temporarily unavailable, due to optimizations>, inject_ipfref=0xffffff800914eaaf) at /SourceCache/xnu/xnu-2050.22.13/bsd/netinet/ip_input.c:663
#16 0xffffff80091ca510 in ip_input (m=0xffffff8a88c81c00) at /SourceCache/xnu/xnu-2050.22.13/bsd/netinet/ip_input.c:777
#17 0xffffff80091ca0ed in ip_proto_input (protocol=<value temporarily unavailable, due to optimizations>, packet_list=0x0) at /SourceCache/xnu/xnu-2050.22.13/bsd/netinet/ip_input.c:553
#18 0xffffff800915e26c in proto_input (protocol=<value temporarily unavailable, due to optimizations>, packet_list=0xffffff8a88c81c00) at /SourceCache/xnu/xnu-2050.22.13/bsd/net/kpi_protocol.c:290
#19 0xffffff80091405fb in ether_inet_input (ifp=<value temporarily unavailable, due to optimizations>, protocol_family=<value temporarily unavailable, due to optimizations>, m_list=0xffffff8a88c81c00) at /SourceCache/xnu/xnu-2050.22.13/bsd/net/ether_inet_pr_module.c:220
#20 0xffffff800913e7a1 in dlil_ifproto_input (ifproto=0xffffff80b9fd2ce0, m=0xffffff8a88c81c00) at /SourceCache/xnu/xnu-2050.22.13/bsd/net/dlil.c:2717
#21 0xffffff80091389cc in dlil_input_packet_list_common (ifp_param=0x0, m=0xffffff8a88c81c00, cnt=<value temporarily unavailable, due to optimizations>, mode=<value temporarily unavailable, due to optimizations>, ext=<value temporarily unavailable, due to optimizations>) at /SourceCache/xnu/xnu-2050.22.13/bsd/net/dlil.c:2955
#22 0xffffff800913f4b9 in dlil_input_thread_func (v=0xffffff80ba99e200, w=-2000151552) at /SourceCache/xnu/xnu-2050.22.13/bsd/net/dlil.c:2823
(gdb)
>
> And I also posted kernel stack trace for hang on this mailing list in January:
> http://prod.lists.apple.com/archives/darwin-kernel/2013/Jan/msg00007.html
>
> If in case you are running into similar issue: Setting net.link.generic.system.flow_advisory to zero helped to get rid of hang.
> But it affects n/w performance significantly. Hence, we are still waiting for actual solution from apple on this.
>
> See if this helps you. Please update once you receive any solution / workaround for the issue that you are experiencing here.
>
> Thanks,
> Pratima
>
> -----Original Message-----
> From: darwin-kernel-bounces+pratima.kudale=email@hidden [mailto:darwin-kernel-bounces+pratima.kudale=email@hidden] On Behalf Of Steven Bytnar
> Sent: Friday, April 19, 2013 1:22 PM
> To: Andreas Fink
> Cc: email@hidden
> Subject: Re: kernel lockup
>
> Hi,
>
> Instead of a full core dump, how about a summary of the core dump?
> This requires the kernel debug kit, but this used to be be a pretty good summary of what the machine was doing at the time of a panic. I used this with 10.5 to troubleshoot some third party software. It might need to be updated for 10.8.
>
> $ cat pd.sh
> echo Start:
> date
> echo Working on $1
> gdb -c $1 -x pd.gdb > $1.txt
> echo End:
> date
> $ cat pd.gdb
> add-symbol-file /Volumes/KernelDebugKit/mach_kernel
> source /Volumes/KernelDebugKit/kgmacros
> showallstacks
> showallthreads
> showalltasks
> showcurrentthreads
> showcurrentstacks
> showallvm
> zprint
> quit
> $ ./pd.sh {core-file-name}
>
> --Steve
>
>
> On Fri, Apr 19, 2013 at 10:02:39PM +0200, Andreas Fink wrote:
>> did that. [1]radar://13696346
>> Unfortunately the kernel coredump is too big to upload (several
>> gigabytes).
>> And now it dumps even after the reboot sometimes.
>> On 18.04.2013, at 17:51, Shantonu Sen <[2]email@hidden> wrote:
>>
>> You can use FireWire KDP if the Ethernet interfaces stop working (see
>> fwkdp(1) or the tech note on this) to attach to the kernel debugger and
>> take a core dump. Depending on the exact issue, Ethernet may work for
>> KDP even if the OS IP stack gets sad. The core dump should indicate the
>> culprit, especially if you start with a proximal symptom such as a
>> hanging process and trace the dependency change of resources or locks.
>> Please file a Radar with the coredump
>> Shantonu
>> On Apr 18, 2013, at 7:20 AM, Andreas Fink <[3]email@hidden>
>> wrote:
>>
>> Hi Folks,
>>
>> I'm running into some kernel related deadlocks here under 10.8.3 which
>> I can not really figure out where to look further.
>> We have the following setup:
>>
>> XServe with two ethernets.
>> en0 private IP's
>> en1 public IPs.
>>
>> on en1 we have several 100's of open tcp sessions at times and thats
>> where all traffic comes in and gets processed (its SMPP protocol)
>> The traffic is answered inside our application and processed and put
>> into a MySQL database (which is connected over en0).
>> a couple of hours later, the system "locks up". Now what really
>> happens is the following:
>>
>> a) you can no longer ping en1, nor does any sockets still work on it.
>> b) you can still ping en0
>> c) on en0, established sessions still work, however opening a new ssh
>> session for example doesn't work.
>> d) typing commands in a still working session most of the time locks
>> up the system. for example "killall myapp" doesn't do nothing and just
>> stalls.
>> e) syslog doesnt show anything spurious.
>> f) my app is still in memory and runs fine
>> g) "top" was showing little CPU load, plenty of free memory. All looks
>> normal.
>> h) netstat -m was not showing any dangerous buffer overflowing.
>> i) an established remote desktop session gets killed
>> j) The appplication doesn't crash,
>> h) The kernel doesn't panic.
>>
>> I was able to run a tcpdump on the interface while this was happening
>> and what I see towards the end is that out of a sudden tcp
>> retransmissions start to pile up. We see lots and lots of them out of
>> the blue.
>> In other words, the kernel seems to stop processing the packets
>> somehow and doesn't acknowledge it to the remote anymore. Also
>> incoming acknowledgments don't get processed.
>> A few seconds later you can't do nothing with the machine anymore and
>> you have to force reboot it over LOM (I praise Apple for implementing
>> LOM into their XServers, even though it has its issues too).
>>
>> It is obvious that the application/traffic somehow manages to saturate
>> some kernel resource which makes that specific ethernet interface
>> being locked up with a side effect on to the whole kernel (like not be
>> able to load any binaries not in memory already).
>>
>> I'm a bit lost to where look further to analyze this issue.
>> Does anyone on this list might have a hint what could happen here?
>>
>> _______________________________________________
>> Do not post admin requests to the list. They will be ignored.
>> Darwin-kernel mailing list ([4]email@hidden)
>> Help/Unsubscribe/Update your Subscription:
>>
>> com
>>
>> This email sent to [6]email@hidden
>>
>> Links:
>> 1. file:///var/folders/Jw/JwJJw00g2Ra53k+1Ynt6pU+++TM/-Tmp-//radar://
>> 2. mailto:email@hidden/
>> 3. mailto:email@hidden/
>> 4. mailto:email@hidden/
>> 5.
>> 6. mailto:email@hidden/
>
>> _______________________________________________
>> Do not post admin requests to the list. They will be ignored.
>> Darwin-kernel mailing list (email@hidden)
>> Help/Unsubscribe/Update your Subscription:
>> bytnar.net
>>
>> This email sent to email@hidden
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Darwin-kernel mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden