Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Too many open files killed xServe from the net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Too many open files killed xServe from the net

Subject: Too many open files killed xServe from the net
From: Helge Staedtler <email@hidden>
Date: Wed, 03 Aug 2005 17:57:40 +0200

Sorry at first putting this here in the dev-list... But this seems to be a
problem which can only be solved by development.

Let's go:

Lately a very obscure thing happened to an xServe of our deployment. The
xServe was killed (became totally unresponsive, neither ssh nor Admin-Tools
did work to restart the machine)  by some cause which I am still searching
for.

this was also the first time I asked myself why I cannot restart an xServe
via Admin-Tools if at least some webObjects-apps still were working. By the
way: this made me code a crontab-entry which regularly checks this situation
and reacts in time to keep the machine responsive.

The facts:

*** In "/var/log/" I found following amazing entry using "ls -lF" after we
restarted the machine manually:

-rw-r--r--     1 root  wheel  34727006208  1 Aug 05:30 webobjects.log.1

*** After checking disk-capacity beeing left I found:

Filesystem              512-blocks      Used  Avail Capacity  Mounted on
/dev/disk0s3             160574256 159498416 563840   100%    /
devfs                          180       180      0   100%    /dev
fdesc                            2         2      0   100%    /dev
<volfs>                       1024      1024      0   100%    /.vol
/dev/disk1s3             489963440 489963424     16   100%
/Volumes/ServerHD
automount -nsl [324]             0         0      0   100%    /Network
automount -fstab [375]           0         0      0   100%
/automount/Servers
automount -static [375]          0         0      0   100%
/automount/static

*** All disks were 100% full!
*** After immediately deleting the monster-logfile of webobjects because
otherwise I may have ended up unable to even boot the machine another
time... I checked the "/var/log/system.log" which showed:

Jul 29 08:02:22 <realServerNameReplaced> last message repeated 2 times
Jul 29 08:18:36 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
64.191.227.251:1331 131.188.76.13:1433 in via en0
Jul 29 08:18:39 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
64.191.227.251:1331 131.188.76.13:1433 in via en0
Jul 29 08:38:51 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
213.138.52.133:4428 131.188.76.13:10000 in via en0
Jul 29 08:38:54 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
213.138.52.133:4428 131.188.76.13:10000 in via en0
Jul 29 10:43:53 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
132.176.163.104:3169 131.188.76.13:1433 in via en0
Jul 29 10:43:56 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
132.176.163.104:3169 131.188.76.13:1433 in via en0
Jul 29 10:56:26 <realServerNameReplaced> kernel: file: table is full
Jul 29 10:56:54 <realServerNameReplaced> last message repeated 214 times
Jul 29 10:56:57 <realServerNameReplaced> kernel: ble is full
Jul 29 10:56:57 <realServerNameReplaced> kernel: file: table is full
Jul 29 10:56:58 <realServerNameReplaced> last message repeated 147 times
Jul 29 10:56:59 <realServerNameReplaced> kernel: ble is full
Jul 29 10:56:59 <realServerNameReplaced> kernel: file: table is full
Jul 29 10:57:32 <realServerNameReplaced> last message repeated 95 times
Jul 29 10:58:22 <realServerNameReplaced> last message repeated 237 times
Jul 29 10:58:22 <realServerNameReplaced> postfix/qmgr[339]: fatal:
scan_dir_push: open directory incoming/0: Too many open files in system
Jul 29 10:58:22 <realServerNameReplaced> kernel: file: table is full
Jul 29 10:58:42 <realServerNameReplaced> last message repeated 5 times
Jul 29 11:01:04 <realServerNameReplaced> last message repeated 9 times
Jul 29 11:03:20 <realServerNameReplaced> last message repeated 33 times
Jul 29 09:03:20 <realServerNameReplaced> /usr/libexec/crashreporterd:
crashdump[8477] exited due to signal 5
Jul 29 11:03:22 <realServerNameReplaced> kernel: file: table is full
Jul 29 11:03:23 <realServerNameReplaced> last message repeated 4 times
Jul 29 11:03:24 <realServerNameReplaced> kernel: ull
Jul 29 11:03:24 <realServerNameReplaced> kernel: file: table is full
Jul 29 11:03:24 <realServerNameReplaced> last message repeated 198 times
Jul 29 11:03:24 <realServerNameReplaced> lookupd[243]: NetInfo connection
failed for server 127.0.0.1/local
Jul 29 11:03:24 <realServerNameReplaced> kernel: file: table is full

*** and so on...
*** Digging a bit more in the logfiles, i found out that some files which
get usually read by one of our WebObjects-apps could not be read because auf
this too much open files error.
*** Having monitored our server now for 3 days, i recognize that the number
of open files slightly rises up for one of our WebObjects apps
*** typing 'lsof | grep -c "java"' with root privileges brings up somthing
like this:

java    9786 root 1307                                     can't read file
struct from 0x05737b90
java    9786 root 1308                                     can't read file
struct from 0x056b358c
java    9786 root 1309                                     can't read file
struct from 0x056b3a78
java    9786 root 1310r  VREG       14,5      318  1049897 / --
DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1515161
0251723EF18TOP19F020F021T1.info
java    9786 root 1311                                     can't read file
struct from 0x05739720
java    9786 root 1312r  VREG       14,5      318  1049897 / --
DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1515161
0251723EF18TOP19F020F021T1.info
java    9786 root 1313r  VREG       14,5      318  1049897 / --
DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1515161
0251723EF18TOP19F020F021T1.info
java    9786 root 1314r  VREG       14,5      318  1049897 / --
DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1515161
0251723EF18TOP19F020F021T1.info

*** where "9786 " is the processid of our webobjects app and i register
about 1319 open files. The files shown here are the ones which get read
regularly and sometimes they get written. Checking the maximum number of
files which are allowed per process using "sysctl -a" revealed:

kern.maxfilesperproc = 10240

I suppose that an increasing number of these open files may have caused the
"wotaskd" to writte this repeatingly in the monsterfile until the complete
standstill of the server because of missing diskspace was unavoidable.

*** NOW my question: Has anyone also experienced such a behaviour? What may
have caused such a complete and from my point of view severe breakdown? At
the same time methods to prevent this are welcome. Does anyone have
experience with how to keep an xServe at least "restartable" no matter how
weird the circumstances are? (perhaps some daemon running and listening for
the ultimate restart-request on the net)

By the way, is there a maximum limit in the number of files which can be put
in ONE directory? What happens if this limit is exceeded?

Any experience or helpful hint would be welcome.

Regards,
Helge








 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

Follow-Ups:
- Re: Too many open files killed xServe from the net
  - From: Lucas Holt <email@hidden>
- Re: Too many open files killed xServe from the net
  - From: Dev WO <email@hidden>
- Re: Too many open files killed xServe from the net
  - From: "Jerry W. Walker" <email@hidden>

Prev by Date: Re: Applet to Server
Next by Date: Re: Too many open files killed xServe from the net
Previous by thread: Re: Applet to Server
Next by thread: Re: Too many open files killed xServe from the net
Index(es):
- Date
- Thread