kernel lockup
kernel lockup
- Subject: kernel lockup
- From: Andreas Fink <email@hidden>
- Date: Thu, 18 Apr 2013 16:20:31 +0200
Hi Folks,
I'm running into some kernel related deadlocks here under 10.8.3 which I can not really figure out where to look further.
We have the following setup:
XServe with two ethernets.
en0 private IP's
en1 public IPs.
on en1 we have several 100's of open tcp sessions at times and thats where all traffic comes in and gets processed (its SMPP protocol)
The traffic is answered inside our application and processed and put into a MySQL database (which is connected over en0).
a couple of hours later, the system "locks up". Now what really happens is the following:
a) you can no longer ping en1, nor does any sockets still work on it.
b) you can still ping en0
c) on en0, established sessions still work, however opening a new ssh session for example doesn't work.
d) typing commands in a still working session most of the time locks up the system. for example "killall myapp" doesn't do nothing and just stalls.
e) syslog doesnt show anything spurious.
f) my app is still in memory and runs fine
g) "top" was showing little CPU load, plenty of free memory. All looks normal.
h) netstat -m was not showing any dangerous buffer overflowing.
i) an established remote desktop session gets killed
j) The appplication doesn't crash,
h) The kernel doesn't panic.
I was able to run a tcpdump on the interface while this was happening and what I see towards the end is that out of a sudden tcp retransmissions start to pile up. We see lots and lots of them out of the blue.
In other words, the kernel seems to stop processing the packets somehow and doesn't acknowledge it to the remote anymore. Also incoming acknowledgments don't get processed.
A few seconds later you can't do nothing with the machine anymore and you have to force reboot it over LOM (I praise Apple for implementing LOM into their XServers, even though it has its issues too).
It is obvious that the application/traffic somehow manages to saturate some kernel resource which makes that specific ethernet interface being locked up with a side effect on to the whole kernel (like not be able to load any binaries not in memory already).
I'm a bit lost to where look further to analyze this issue.
Does anyone on this list might have a hint what could happen here?
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden