AW: Too many open files killed xServe from the net
AW: Too many open files killed xServe from the net
- Subject: AW: Too many open files killed xServe from the net
- From: Helge Staedtler <email@hidden>
- Date: Thu, 04 Aug 2005 11:50:50 +0200
Thanks to all of you so far.
I will do some exhaustive testing on a local dev-machine by coding some
Java-lines which try to bring the system down. Perhaps I will then be able
to reproduce the failure.
One thing I cannot exclude is that "too many open files" really also meant
the open sockets for tcp-based connections used. But this bug of WebObjects
was gone since I switched to AFAIK WO 5.2.3. In those cases the logfile of
the application was growing abnormously.
But this time this was probably not the case, because this must have left
some traces in the application-logfile. I will also try some customized
logging, e.g. Writing a shellscript which regularly checks the size of this
webobjects.log-beast and cut it down in case of... Just to save the xServe
from beeing stuck again.
So thanks to all up to this point.
Regards,
Helge
Am 04.08.2005 6:40 Uhr schrieb "email@hidden" unter <email@hidden>:
> Most Unix variants let you increase the number of file descriptors. I
> did a quick google search and came up with this link:
>
> http://www.amug.org/~glguerin/howto/More-open-files.html
>
> Which discusses how Java apps are limited to 256 files open at once on
> OS X and how to fix it. This was just the first interesting looking
> hit; if this guy proves to be giving bad advice there are plenty more
> to look at.
>
> janine
>
> On Aug 3, 2005, at 6:01 PM, Lucas Holt wrote:
>
>> You probably hit some type of limit on the system for files.
>>
>> On Aug 3, 2005, at 11:57 AM, Helge Staedtler wrote:
>>
>>> Sorry at first putting this here in the dev-list... But this seems to
>>> be a
>>> problem which can only be solved by development.
>>>
>>> Let's go:
>>>
>>> Lately a very obscure thing happened to an xServe of our deployment.
>>> The
>>> xServe was killed (became totally unresponsive, neither ssh nor
>>> Admin-Tools
>>> did work to restart the machine) by some cause which I am still
>>> searching
>>> for.
>>>
>>> this was also the first time I asked myself why I cannot restart an
>>> xServe
>>> via Admin-Tools if at least some webObjects-apps still were working.
>>> By the
>>> way: this made me code a crontab-entry which regularly checks this
>>> situation
>>> and reacts in time to keep the machine responsive.
>>>
>>> The facts:
>>>
>>> *** In "/var/log/" I found following amazing entry using "ls -lF"
>>> after we
>>> restarted the machine manually:
>>>
>>> -rw-r--r-- 1 root wheel 34727006208 1 Aug 05:30
>>> webobjects.log.1
>>>
>>> *** After checking disk-capacity beeing left I found:
>>>
>>> Filesystem 512-blocks Used Avail Capacity Mounted
>>> on
>>> /dev/disk0s3 160574256 159498416 563840 100% /
>>> devfs 180 180 0 100% /dev
>>> fdesc 2 2 0 100% /dev
>>> <volfs> 1024 1024 0 100% /.vol
>>> /dev/disk1s3 489963440 489963424 16 100%
>>> /Volumes/ServerHD
>>> automount -nsl [324] 0 0 0 100% /Network
>>> automount -fstab [375] 0 0 0 100%
>>> /automount/Servers
>>> automount -static [375] 0 0 0 100%
>>> /automount/static
>>>
>>> *** All disks were 100% full!
>>> *** After immediately deleting the monster-logfile of webobjects
>>> because
>>> otherwise I may have ended up unable to even boot the machine another
>>> time... I checked the "/var/log/system.log" which showed:
>>>
>>> Jul 29 08:02:22 <realServerNameReplaced> last message repeated 2 times
>>> Jul 29 08:18:36 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
>>> 64.191.227.251:1331 131.188.76.13:1433 in via en0
>>> Jul 29 08:18:39 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
>>> 64.191.227.251:1331 131.188.76.13:1433 in via en0
>>> Jul 29 08:38:51 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
>>> 213.138.52.133:4428 131.188.76.13:10000 in via en0
>>> Jul 29 08:38:54 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
>>> 213.138.52.133:4428 131.188.76.13:10000 in via en0
>>> Jul 29 10:43:53 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
>>> 132.176.163.104:3169 131.188.76.13:1433 in via en0
>>> Jul 29 10:43:56 <realServerNameReplaced> kernel: ipfw: 65000 Deny TCP
>>> 132.176.163.104:3169 131.188.76.13:1433 in via en0
>>> Jul 29 10:56:26 <realServerNameReplaced> kernel: file: table is full
>>> Jul 29 10:56:54 <realServerNameReplaced> last message repeated 214
>>> times
>>> Jul 29 10:56:57 <realServerNameReplaced> kernel: ble is full
>>> Jul 29 10:56:57 <realServerNameReplaced> kernel: file: table is full
>>> Jul 29 10:56:58 <realServerNameReplaced> last message repeated 147
>>> times
>>> Jul 29 10:56:59 <realServerNameReplaced> kernel: ble is full
>>> Jul 29 10:56:59 <realServerNameReplaced> kernel: file: table is full
>>> Jul 29 10:57:32 <realServerNameReplaced> last message repeated 95
>>> times
>>> Jul 29 10:58:22 <realServerNameReplaced> last message repeated 237
>>> times
>>> Jul 29 10:58:22 <realServerNameReplaced> postfix/qmgr[339]: fatal:
>>> scan_dir_push: open directory incoming/0: Too many open files in
>>> system
>>> Jul 29 10:58:22 <realServerNameReplaced> kernel: file: table is full
>>> Jul 29 10:58:42 <realServerNameReplaced> last message repeated 5 times
>>> Jul 29 11:01:04 <realServerNameReplaced> last message repeated 9 times
>>> Jul 29 11:03:20 <realServerNameReplaced> last message repeated 33
>>> times
>>> Jul 29 09:03:20 <realServerNameReplaced> /usr/libexec/crashreporterd:
>>> crashdump[8477] exited due to signal 5
>>> Jul 29 11:03:22 <realServerNameReplaced> kernel: file: table is full
>>> Jul 29 11:03:23 <realServerNameReplaced> last message repeated 4 times
>>> Jul 29 11:03:24 <realServerNameReplaced> kernel: ull
>>> Jul 29 11:03:24 <realServerNameReplaced> kernel: file: table is full
>>> Jul 29 11:03:24 <realServerNameReplaced> last message repeated 198
>>> times
>>> Jul 29 11:03:24 <realServerNameReplaced> lookupd[243]: NetInfo
>>> connection
>>> failed for server 127.0.0.1/local
>>> Jul 29 11:03:24 <realServerNameReplaced> kernel: file: table is full
>>>
>>> *** and so on...
>>> *** Digging a bit more in the logfiles, i found out that some files
>>> which
>>> get usually read by one of our WebObjects-apps could not be read
>>> because auf
>>> this too much open files error.
>>> *** Having monitored our server now for 3 days, i recognize that the
>>> number
>>> of open files slightly rises up for one of our WebObjects apps
>>> *** typing 'lsof | grep -c "java"' with root privileges brings up
>>> somthing
>>> like this:
>>>
>>> java 9786 root 1307 can't read
>>> file
>>> struct from 0x05737b90
>>> java 9786 root 1308 can't read
>>> file
>>> struct from 0x056b358c
>>> java 9786 root 1309 can't read
>>> file
>>> struct from 0x056b3a78
>>> java 9786 root 1310r VREG 14,5 318 1049897 / --
>>> DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1
>>> 515161
>>> 0251723EF18TOP19F020F021T1.info
>>> java 9786 root 1311 can't read
>>> file
>>> struct from 0x05739720
>>> java 9786 root 1312r VREG 14,5 318 1049897 / --
>>> DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1
>>> 515161
>>> 0251723EF18TOP19F020F021T1.info
>>> java 9786 root 1313r VREG 14,5 318 1049897 / --
>>> DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1
>>> 515161
>>> 0251723EF18TOP19F020F021T1.info
>>> java 9786 root 1314r VREG 14,5 318 1049897 / --
>>> DNA_TL012e0f02F0304505T10602e3f07F0823d8e8f0923EF102321LEFT121013T14T1
>>> 515161
>>> 0251723EF18TOP19F020F021T1.info
>>>
>>> *** where "9786 " is the processid of our webobjects app and i
>>> register
>>> about 1319 open files. The files shown here are the ones which get
>>> read
>>> regularly and sometimes they get written. Checking the maximum number
>>> of
>>> files which are allowed per process using "sysctl -a" revealed:
>>>
>>> kern.maxfilesperproc = 10240
>>>
>>> I suppose that an increasing number of these open files may have
>>> caused the
>>> "wotaskd" to writte this repeatingly in the monsterfile until the
>>> complete
>>> standstill of the server because of missing diskspace was unavoidable.
>>>
>>> *** NOW my question: Has anyone also experienced such a behaviour?
>>> What may
>>> have caused such a complete and from my point of view severe
>>> breakdown? At
>>> the same time methods to prevent this are welcome. Does anyone have
>>> experience with how to keep an xServe at least "restartable" no
>>> matter how
>>> weird the circumstances are? (perhaps some daemon running and
>>> listening for
>>> the ultimate restart-request on the net)
>>>
>>> By the way, is there a maximum limit in the number of files which can
>>> be put
>>> in ONE directory? What happens if this limit is exceeded?
>>>
>>> Any experience or helpful hint would be welcome.
>>>
>>> Regards,
>>> Helge
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Do not post admin requests to the list. They will be ignored.
>>> Webobjects-dev mailing list (email@hidden)
>>> Help/Unsubscribe/Update your Subscription:
>>> email@hidden
>>>
>>> This email sent to email@hidden
>>>
>>
>>
>> Lucas Holt
>> email@hidden
>> ________________________________________________________
>> FoolishGames.com (Jewel Fan Site)
>> JustJournal.com (Free blogging)
>> FoolishGames.net (Enemy Territory IoM site)
>>
>> Think PC.. in 2006 you can own an Apple PCintosh. Whats next, windows
>> works?
>>
>> _______________________________________________
>> Do not post admin requests to the list. They will be ignored.
>> Webobjects-dev mailing list (email@hidden)
>> Help/Unsubscribe/Update your Subscription:
>>
>> This email sent to email@hidden
>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden