Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Too many open files killed xServe from the net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Too many open files killed xServe from the net

Subject: Re: Too many open files killed xServe from the net
From: "Jerry W. Walker" <email@hidden>
Date: Wed, 3 Aug 2005 12:24:12 -0400

Hi, Helge,

I'm sure you will get more precise answers from those with more sysadmin experience than myself, but clearly, the server froze because it ran out of disk space.

Unix systems, in general, are very unforgiving when you exhaust their disk space. In order to keep the system more robust, I would suggest two alternatives:

* figure out what is causing the incredible amount of output to go into the webobjects.log and reduce it substantially (by orders of magnitude).

* periodically check available disk space and send out warnings if it gets too low so that files can be deleted or truncated before starving the system.

Either of these approaches will keep the system much more reliable and will keep it rebootable.

Regarding the "kernel: file: table is full" messages, I don't understand why the app needs thousands of open files. It seems that something is failing to close files that have been used. I presume that many or most of these files are read rather than written. If so, get the app to read the data and close the file. If segments of data from several thousand files must be accessed randomly and repeatedly, it seems that much of that data should be migrated to a database where such simultaneous random access is addressed more effectively.

Beyond that, more specific sysadmin help will have to come from others on the list.

Good luck.

Regards,
Jerry

On Aug 3, 2005, at 11:57 AM, Helge Staedtler wrote:

Sorry at first putting this here in the dev-list... But this seems to be a problem which can only be solved by development.
Let's go:
Lately a very obscure thing happened to an xServe of our deployment. The xServe was killed (became totally unresponsive, neither ssh nor Admin-Tools did work to restart the machine) by some cause which I am still searching for.

this was also the first time I asked myself why I cannot restart an xServe via Admin-Tools if at least some webObjects-apps still were working. By the way: this made me code a crontab-entry which regularly checks this situation and reacts in time to keep the machine responsive.
The facts:
*** In "/var/log/" I found following amazing entry using "ls -lF" after we restarted the machine manually:

-rw-r--r-- 1 root wheel 34727006208 1 Aug 05:30 webobjects.log.1
*** After checking disk-capacity beeing left I found:
Filesystem 512-blocks Used Avail Capacity Mounted on /dev/disk0s3 160574256 159498416 563840 100% / devfs 180 180 0 100% /dev fdesc 2 2 0 100% /dev <volfs> 1024 1024 0 100% /.vol /dev/disk1s3 489963440 489963424 16 100% /Volumes/ServerHD automount -nsl [324] 0 0 0 100% /Network automount -fstab [375] 0 0 0 100% /automount/Servers automount -static [375] 0 0 0 100% /automount/static

*** All disks were 100% full! *** After immediately deleting the monster-logfile of webobjects because otherwise I may have ended up unable to even boot the machine another time... I checked the "/var/log/system.log" which showed:

Jul 29 08:02:22 <realServerNameReplaced> last message repeated 2 times Jul 29 08:18:36 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP 64.191.227.251:1331 131.188.76.13:1433 in via en0 Jul 29 08:18:39 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP 64.191.227.251:1331 131.188.76.13:1433 in via en0 Jul 29 08:38:51 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP 213.138.52.133:4428 131.188.76.13:10000 in via en0 Jul 29 08:38:54 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP 213.138.52.133:4428 131.188.76.13:10000 in via en0 Jul 29 10:43:53 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP 132.176.163.104:3169 131.188.76.13:1433 in via en0 Jul 29 10:43:56 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP 132.176.163.104:3169 131.188.76.13:1433 in via en0 Jul 29 10:56:26 <realServerNameReplaced> kernel: file: table is full Jul 29 10:56:54 <realServerNameReplaced> last message repeated 214 times Jul 29 10:56:57 <realServerNameReplaced> kernel: ble is full Jul 29 10:56:57 <realServerNameReplaced> kernel: file: table is full Jul 29 10:56:58 <realServerNameReplaced> last message repeated 147 times Jul 29 10:56:59 <realServerNameReplaced> kernel: ble is full Jul 29 10:56:59 <realServerNameReplaced> kernel: file: table is full Jul 29 10:57:32 <realServerNameReplaced> last message repeated 95 times Jul 29 10:58:22 <realServerNameReplaced> last message repeated 237 times Jul 29 10:58:22 <realServerNameReplaced> postfix/qmgr[339]: fatal: scan_dir_push: open directory incoming/0: Too many open files in system Jul 29 10:58:22 <realServerNameReplaced> kernel: file: table is full Jul 29 10:58:42 <realServerNameReplaced> last message repeated 5 times Jul 29 11:01:04 <realServerNameReplaced> last message repeated 9 times Jul 29 11:03:20 <realServerNameReplaced> last message repeated 33 times Jul 29 09:03:20 <realServerNameReplaced> /usr/libexec/crashreporterd: crashdump[8477] exited due to signal 5 Jul 29 11:03:22 <realServerNameReplaced> kernel: file: table is full Jul 29 11:03:23 <realServerNameReplaced> last message repeated 4 times Jul 29 11:03:24 <realServerNameReplaced> kernel: ull Jul 29 11:03:24 <realServerNameReplaced> kernel: file: table is full Jul 29 11:03:24 <realServerNameReplaced> last message repeated 198 times Jul 29 11:03:24 <realServerNameReplaced> lookupd[243]: NetInfo connection failed for server 127.0.0.1/local Jul 29 11:03:24 <realServerNameReplaced> kernel: file: table is full

*** and so on... *** Digging a bit more in the logfiles, i found out that some files which get usually read by one of our WebObjects-apps could not be read because auf this too much open files error. *** Having monitored our server now for 3 days, i recognize that the number of open files slightly rises up for one of our WebObjects apps *** typing 'lsof | grep -c "java"' with root privileges brings up somthing like this:

java 9786 root 1307 can't read file struct from 0x05737b90 java 9786 root 1308 can't read file struct from 0x056b358c java 9786 root 1309 can't read file struct from 0x056b3a78 java 9786 root 1310r VREG 14,5 318 1049897 / -- DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1 515161 0251723EF18TOP19F020F021T1.info java 9786 root 1311 can't read file struct from 0x05739720 java 9786 root 1312r VREG 14,5 318 1049897 / -- DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1 515161 0251723EF18TOP19F020F021T1.info java 9786 root 1313r VREG 14,5 318 1049897 / -- DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1 515161 0251723EF18TOP19F020F021T1.info java 9786 root 1314r VREG 14,5 318 1049897 / -- DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1 515161 0251723EF18TOP19F020F021T1.info

*** where "9786 " is the processid of our webobjects app and i register about 1319 open files. The files shown here are the ones which get read regularly and sometimes they get written. Checking the maximum number of files which are allowed per process using "sysctl -a" revealed:
kern.maxfilesperproc = 10240
I suppose that an increasing number of these open files may have caused the "wotaskd" to writte this repeatingly in the monsterfile until the complete standstill of the server because of missing diskspace was unavoidable.

*** NOW my question: Has anyone also experienced such a behaviour? What may have caused such a complete and from my point of view severe breakdown? At the same time methods to prevent this are welcome. Does anyone have experience with how to keep an xServe at least "restartable" no matter how weird the circumstances are? (perhaps some daemon running and listening for the ultimate restart-request on the net)

By the way, is there a maximum limit in the number of files which can be put in ONE directory? What happens if this limit is exceeded?
Any experience or helpful hint would be welcome.
Regards,
Helge
_______________________________________________ Do not post admin requests to the list. They will be ignored. Webobjects-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: 40codefab.com
This email sent to email@hidden

-- __ Jerry W. Walker, Partner C o d e F a b, LLC - "High Performance Industrial Strength Internet Enabled Systems" email@hidden 212 465 8484 X-102 office 212 465 9178 fax


_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden



References:  
  >Too many open files killed xServe from the net (From: Helge Staedtler <email@hidden>)




Prev by Date:
Too many open files killed xServe from the net

Next by Date:
Re: Too many open files killed xServe from the net

Previous by thread:
Too many open files killed xServe from the net

Next by thread:
Re: Too many open files killed xServe from the net

Index(es):

Date
Thread