Re: analysing a production failure
Re: analysing a production failure
- Subject: Re: analysing a production failure
- From: Pascal Robert <email@hidden>
- Date: Thu, 15 Oct 2009 15:57:21 -0400
Le 09-10-15 à 15:53, Chuck Hill a écrit :
Hi Simon,
On Oct 15, 2009, at 12:34 PM, Simon McLean wrote:
earlier on today javamonitor started reporting that it was unable
to write it's config to one of our app servers,
It is actually wotaskd that can't write.
hinting that it could be a permissions issue. within minutes of me
starting to debug it we had apps starting to fall over all over the
shop until our entire production environment had gone down. shit.
the server that was apparently misbehaving is one of a kind, in
that it just runs a couple of internal admin apps, yet whatever was
happening managed to bring down completely unrelated apps running
on completely different servers - but all being controlled by the
same instance of monitor... i figured in the end that the
permissions hint was a red herring - the server concerned had
simply run out of disk space (doh!). however sorting that out
didn't fix a thing - i couldn't get anything back on air. (by "back
on air" i mean accessible. all the apps would start up and claim to
be running in monitor, but none would respond to a request - even
after rebooting hardware - so clearly our monitor configs had bust)
and after much swearing, desk banging and praying for forgiveness
from the wo-gods, i ended up binning all the sitecofnfig.xml and
woconfig.xml files from all our app servers and building the entire
config back up from scratch. and hey presto, we were back on air.
a couple of questions:
1) is it feasible that the disk space issue caused monitor to go
bananas ?
It certainly would cause problems for wotaskd and that does cause
monitor to do strange things. Even a deadlocked app can cause
monitor problems (though I have never seen this corrupt the
configuration -- happily!)
2) why would a problem with one host start knocking out other hosts?
Bad coding in JavaMonitor? Bad config getting sent to the other
hosts due to mishandling of an exception somewhere?
3) i expected monitor to forgive me once i had cleaned up the disk
space issue, rather than just sit there in a huff ?
A man can dream... I get the sense that JavaMonitor and wotaskd are
not paragons of good coding.
I think Anjo fixed some bugs about that in Wonder's version of
wotaskd, I just don't remember the exact details. It also do automatic
backup of SiteConfig.xml when a configuration change is done.
4) single subnet + single monitor = single point of failure. i
guess our only option is to break it up into multiple subnets and
have multiple monitors ?
Yes, where possible.
Chuck
--
Chuck Hill Senior Consultant / VP Development
Practical WebObjects - for developers who want to increase their
overall knowledge of WebObjects or who are trying to solve specific
problems.
http://www.global-village.net/products/practical_webobjects
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden