Re: analysing a production failure
Re: analysing a production failure
- Subject: Re: analysing a production failure
- From: Chuck Hill <email@hidden>
- Date: Thu, 15 Oct 2009 12:53:45 -0700
Hi Simon,
On Oct 15, 2009, at 12:34 PM, Simon McLean wrote:
earlier on today javamonitor started reporting that it was unable to
write it's config to one of our app servers,
It is actually wotaskd that can't write.
hinting that it could be a permissions issue. within minutes of me
starting to debug it we had apps starting to fall over all over the
shop until our entire production environment had gone down. shit.
the server that was apparently misbehaving is one of a kind, in that
it just runs a couple of internal admin apps, yet whatever was
happening managed to bring down completely unrelated apps running on
completely different servers - but all being controlled by the same
instance of monitor... i figured in the end that the permissions
hint was a red herring - the server concerned had simply run out of
disk space (doh!). however sorting that out didn't fix a thing - i
couldn't get anything back on air. (by "back on air" i mean
accessible. all the apps would start up and claim to be running in
monitor, but none would respond to a request - even after rebooting
hardware - so clearly our monitor configs had bust)
and after much swearing, desk banging and praying for forgiveness
from the wo-gods, i ended up binning all the sitecofnfig.xml and
woconfig.xml files from all our app servers and building the entire
config back up from scratch. and hey presto, we were back on air.
a couple of questions:
1) is it feasible that the disk space issue caused monitor to go
bananas ?
It certainly would cause problems for wotaskd and that does cause
monitor to do strange things. Even a deadlocked app can cause monitor
problems (though I have never seen this corrupt the configuration --
happily!)
2) why would a problem with one host start knocking out other hosts?
Bad coding in JavaMonitor? Bad config getting sent to the other hosts
due to mishandling of an exception somewhere?
3) i expected monitor to forgive me once i had cleaned up the disk
space issue, rather than just sit there in a huff ?
A man can dream... I get the sense that JavaMonitor and wotaskd are
not paragons of good coding.
4) single subnet + single monitor = single point of failure. i guess
our only option is to break it up into multiple subnets and have
multiple monitors ?
Yes, where possible.
Chuck
--
Chuck Hill Senior Consultant / VP Development
Practical WebObjects - for developers who want to increase their
overall knowledge of WebObjects or who are trying to solve specific
problems.
http://www.global-village.net/products/practical_webobjects
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden