analysing a production failure
analysing a production failure
- Subject: analysing a production failure
- From: Simon McLean <email@hidden>
- Date: Thu, 15 Oct 2009 20:34:37 +0100
earlier on today javamonitor started reporting that it was unable to write it's config to one of our app servers, hinting that it could be a permissions issue. within minutes of me starting to debug it we had apps starting to fall over all over the shop until our entire production environment had gone down. shit.
the server that was apparently misbehaving is one of a kind, in that it just runs a couple of internal admin apps, yet whatever was happening managed to bring down completely unrelated apps running on completely different servers - but all being controlled by the same instance of monitor... i figured in the end that the permissions hint was a red herring - the server concerned had simply run out of disk space (doh!). however sorting that out didn't fix a thing - i couldn't get anything back on air. (by "back on air" i mean accessible. all the apps would start up and claim to be running in monitor, but none would respond to a request - even after rebooting hardware - so clearly our monitor configs had bust)
and after much swearing, desk banging and praying for forgiveness from the wo-gods, i ended up binning all the sitecofnfig.xml and woconfig.xml files from all our app servers and building the entire config back up from scratch. and hey presto, we were back on air.
a couple of questions:
1) is it feasible that the disk space issue caused monitor to go bananas ?
2) why would a problem with one host start knocking out other hosts?
3) i expected monitor to forgive me once i had cleaned up the disk space issue, rather than just sit there in a huff ?
4) single subnet + single monitor = single point of failure. i guess our only option is to break it up into multiple subnets and have multiple monitors ?
thanks, simon
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden