• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: analysing a production failure
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: analysing a production failure


  • Subject: Re: analysing a production failure
  • From: Pascal Robert <email@hidden>
  • Date: Thu, 15 Oct 2009 15:57:21 -0400


Le 09-10-15 à 15:53, Chuck Hill a écrit :

Hi Simon,


On Oct 15, 2009, at 12:34 PM, Simon McLean wrote:

earlier on today javamonitor started reporting that it was unable to write it's config to one of our app servers,

It is actually wotaskd that can't write.


hinting that it could be a permissions issue. within minutes of me starting to debug it we had apps starting to fall over all over the shop until our entire production environment had gone down. shit.

the server that was apparently misbehaving is one of a kind, in that it just runs a couple of internal admin apps, yet whatever was happening managed to bring down completely unrelated apps running on completely different servers - but all being controlled by the same instance of monitor... i figured in the end that the permissions hint was a red herring - the server concerned had simply run out of disk space (doh!). however sorting that out didn't fix a thing - i couldn't get anything back on air. (by "back on air" i mean accessible. all the apps would start up and claim to be running in monitor, but none would respond to a request - even after rebooting hardware - so clearly our monitor configs had bust)

and after much swearing, desk banging and praying for forgiveness from the wo-gods, i ended up binning all the sitecofnfig.xml and woconfig.xml files from all our app servers and building the entire config back up from scratch. and hey presto, we were back on air.

a couple of questions:

1) is it feasible that the disk space issue caused monitor to go bananas ?

It certainly would cause problems for wotaskd and that does cause monitor to do strange things. Even a deadlocked app can cause monitor problems (though I have never seen this corrupt the configuration -- happily!)



2) why would a problem with one host start knocking out other hosts?

Bad coding in JavaMonitor? Bad config getting sent to the other hosts due to mishandling of an exception somewhere?



3) i expected monitor to forgive me once i had cleaned up the disk space issue, rather than just sit there in a huff ?

A man can dream... I get the sense that JavaMonitor and wotaskd are not paragons of good coding.

I think Anjo fixed some bugs about that in Wonder's version of wotaskd, I just don't remember the exact details. It also do automatic backup of SiteConfig.xml when a configuration change is done.



4) single subnet + single monitor = single point of failure. i guess our only option is to break it up into multiple subnets and have multiple monitors ?

Yes, where possible.


Chuck

--
Chuck Hill             Senior Consultant / VP Development

Practical WebObjects - for developers who want to increase their overall knowledge of WebObjects or who are trying to solve specific problems.
http://www.global-village.net/products/practical_webobjects








_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


  • Follow-Ups:
    • Re: analysing a production failure
      • From: Chuck Hill <email@hidden>
References: 
 >analysing a production failure (From: Simon McLean <email@hidden>)
 >Re: analysing a production failure (From: Chuck Hill <email@hidden>)

  • Prev by Date: Re: analysing a production failure
  • Next by Date: Re: EORelationship can't find Destination Entity *FIXED*
  • Previous by thread: Re: analysing a production failure
  • Next by thread: Re: analysing a production failure
  • Index(es):
    • Date
    • Thread