Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists
Re: WoMonitor "Failed to contact ..."
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: WoMonitor "Failed to contact ..."

Subject: Re: WoMonitor "Failed to contact ..."
From: Chuck Hill <email@hidden>
Date: Sat, 03 Aug 2013 09:01:52 -0700
On 2013-08-03, at 3:09 AM, Philippe Rabier wrote:

> Hi Chuck,
>
> I gonna try to investigate a little bit because what I don't understand something: the monitor communicates only with the wotaskd. And the wotaskd could tell the monitor: "the instance X seems to be dead".
>
> I gonna take some times to read the code.


Thank you.  That has been on my To Do (i.e. I will never get the time to do) list for a while.  It always seemed like something that could be improved.


> Anyway.
>
> About the lock issue, there is only one eof stack. All threads were locked at the same line. We handle millions of requests per day that use ERXEnterpriseObjectCache.
>
> I still keep the dump if you want.

I would like to have a look at it.


Chuck



> Le 2 août 2013 à 17:28, Chuck Hill <email@hidden> a écrit :
>
>> Hi Philippe,
>>
>> A deadlocked instance or an instance that is very slow in responding can cause this message.  I have never tracked it down in the code, but I think it is just the wrong message (some code is catching a timeout exception and reporting it as this).  It really means "Failed to get a response from an instance on ....".  A deadlocked instance is the first thing that I check for when I see this.
>>
>> For your deadlock:
>>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>>    at er.extensions.eof.ERXEnterpriseObjectCache.cache(ERXEnterpriseObjectCache.java:380)
>>
>> Do you have more than one EOF stack (Object Store Co-ordinator)?  I have a deadlock to investigate related to that.
>>
>>
>> Chuck
>>
>>
>> On 2013-08-02, at 7:49 AM, Philippe Rabier wrote:
>>
>>> Hi All,
>>>
>>> I resurrect this discussion again ;-)
>>>
>>> We had today the same symptom "Failed to contact..." which was persistent. We got this problem in the past but rarely.
>>>
>>> After googling "Failed to contact..." I found Kieran email. And we got the same result when executing the following command:
>>> ibabar:~ admin$ sudo lsof -i tcp | grep CLOSE_WAIT
>>> java      34524   _appserver  137u  IPv6 0x171e9344      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:58973 (CLOSE_WAIT)
>>> java      34524   _appserver  138u  IPv6 0x2148f5a8      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59191 (CLOSE_WAIT)
>>> java      34524   _appserver  140u  IPv6 0x2141d344      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59070 (CLOSE_WAIT)
>>> java      34524   _appserver  144u  IPv6 0x2e28c984      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59114 (CLOSE_WAIT)
>>> java      34524   _appserver  145u  IPv6 0x2db8bb2c      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59074 (CLOSE_WAIT)
>>> java      34524   _appserver  146u  IPv6 0x13509a70      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:58845 (CLOSE_WAIT)
>>> java      34524   _appserver  152u  IPv6 0x214440e0      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:58853 (CLOSE_WAIT)
>>> java      34524   _appserver  158u  IPv6 0x2db23400      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59155 (CLOSE_WAIT)
>>> java      34524   _appserver  176u  IPv6 0x2e23b19c      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59034 (CLOSE_WAIT)
>>> java      34524   _appserver  178u  IPv6 0x2102f8c8      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59163 (CLOSE_WAIT)
>>> java      34524   _appserver  179u  IPv6 0x21523d90      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59110 (CLOSE_WAIT)
>>> java      34524   _appserver  184u  IPv6 0x20c995a8      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59199 (CLOSE_WAIT)
>>> java      34524   _appserver  187u  IPv6 0x2e1f98c8      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59042 (CLOSE_WAIT)
>>> java      34524   _appserver  190u  IPv6 0x2df27664      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59046 (CLOSE_WAIT)
>>> java      34524   _appserver  191u  IPv6 0x2dd3b4bc      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59086 (CLOSE_WAIT)
>>> java      34524   _appserver  193u  IPv6 0x2e01cf38      0t0  TCP ibabar.sophiacom.fr:dc->ibabar.sophiacom.fr:59050 (CLOSE_WAIT)
>>>
>>> After doing a dump, we saw the threads were locked as follow:
>>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>>    at er.extensions.eof.ERXEnterpriseObjectCache.cache(ERXEnterpriseObjectCache.java:380)
>>>
>>> My question is about the cause of the CLOSE_WAITs and JavaMonitor: why the monitor is not able to contact the wotaskd because one instance is locked and I presume because the wotask is not able to contact the instance above?
>>>
>>> I resurrect this mail because it's a good tip to use if someone get the message "Failed to contact..." in the monitor.
>>>
>>> Cheers,
>>>
>>> Philippe
>>>
>>> On 30 avr. 2009, at 23:30, Kieran Kelleher wrote:
>>>
>>>> Resurrecting this old discussion again :-(
>>>>
>>>> OK, a while ago, one xserve "omega" (running Leopard Server 10.5.6, WO 5.4.X wotaskd with fully embedded WO 5.3.3 apps) showed up in WOMonitor as Failed to Contact again. Remember WOMonitor is running on Tiger Server 10.4.8 with the wotaskd from WO 5.3.3.
>>>>
>>>> Rather than assume this is a wotaskd/networking problem this time, I decided to check the WO apps on that server "192.168.3.154" using lsof and jstack to see if I can find anything unusual and I did:
>>>>
>>>> OK, 192.168.3.154 has 2 apps running on it. pid-479 port 2001) and pid-43 (port 2004). Also wotaskd is running as pid 43
>>>>
>>>> app pid-479 lsof -i tcp:2001 shows nothing unusual
>>>> COMMAND PID       USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
>>>> java    479 _appserver    7u  IPv6 0x830bb2c      0t0  TCP [::192.168.3.154]:dc (LISTEN)
>>>>
>>>> app pid-947 has unusual output, lsof -i tcp:2004 reveals 256 CLOSE_WAITs!!! .... this app is not allowing logins
>>>> http://67.78.26.66:81/~kieran/misc/lsof_tcp_2004_pid_43.txt
>>>>
>>>> BTW, the other IP 192.168.3.149 shown on the CLOSE_WAIT lines is the machine that is running WOMonitor/apache, so this would seem to indicate a lot of hung requests? (that's a question, Chuck ;-) )
>>>>
>>>> lsof for wotaskd itself gives this, which doesn't seem unusual
>>>> bash-3.2# lsof -i tcp:1085
>>>> COMMAND PID       USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
>>>> java     43 _appserver    8u  IPv6 0x6e1d258      0t0  TCP [::192.168.3.154]:webobjects (LISTEN)
>>>> java     43 _appserver   11u  IPv6 0x830b664      0t0  TCP [::192.168.3.154]:webobjects->[::192.168.3.154]:49665 (ESTABLISHED)
>>>> java     43 _appserver   12u  IPv6 0x8e41cd4      0t0  TCP [::192.168.3.154]:webobjects->[::192.168.3.154]:53449 (ESTABLISHED)
>>>> java    479 _appserver   10u  IPv6 0x830b8c8      0t0  TCP [::192.168.3.154]:49665->[::192.168.3.154]:webobjects (ESTABLISHED)
>>>> java    947 _appserver   10u  IPv6 0x8e7d344      0t0  TCP [::192.168.3.154]:53449->[::192.168.3.154]:webobjects (ESTABLISHED)
>>>>
>>>>
>>>> Now looking at the jstack outputs, we also have more useful clues.
>>>>
>>>> jstack on the pid-947 (port 2004) app reveals it has session store deadlocks!! This is the same app with all the CLOSE_WAITs
>>>> http://67.78.26.66:81/~kieran/misc/jstack_pid_947.txt
>>>>
>>>> So, it would seem that the stupid 'Failed to contact" stuff I have been seeing are really caused by Session Store deadlocks. So, the first thing I am going to do now is turn OFF concurrent request handling and turn on Wonder Session Store Deadlock detection for this app ...... however, I would wager that I will not see any Sesion Store deadlocks with concurrent request handling turned off!
>>>>
>>>> Any ideas on a strategy for deadlock detection with concurrent request handling ON?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mar 25, 2009, at 10:34 PM, Chuck Hill wrote:
>>>>
>>>>>
>>>>> On Mar 25, 2009, at 7:21 PM, Kieran Kelleher wrote:
>>>>>
>>>>>> Hi again Chuck,
>>>>>>
>>>>>> If you are going to use the the domain name (for example www.website.com, which resolves to 67.88.91.233 for example) doesn't that mean you have to open port 1085 on the router between public internet and that apache/WoMonitor machine?
>>>>>
>>>>> Apache is behind the firewall.  Only ports 80 and 443 go though.
>>>>>
>>>>>
>>>>> Chuck
>>>>>
>>>>>
>>>>>
>>>>>> -Kieran
>>>>>>
>>>>>> On Mar 23, 2009, at 12:25 PM, Chuck Hill wrote:
>>>>>>
>>>>>>> On Mar 21, 2009, at 6:35 PM, Kieran Kelleher wrote:
>>>>>>>
>>>>>>>> Hi Chuck,
>>>>>>>>
>>>>>>>> Still getting this problem after a few days of running .... last time we discussed, I had updated all the WO servers which run leopard to use IP address for host name...... I still have not touched the single only Tiger machine that is apache and runs the site's WOMonitor and has a couple tiny insignificant WO apps. I am not ready to upgrade this machine to a Leopard machine just yet, so I guess that is the next guy to be updated with IP addresses instead of its Bonjour name ..... but I have a question for you based on your experience with this:
>>>>>>>>
>>>>>>>> - For that primary WOMonitor machine which is the main site webserver, should I change to localhost, 127.0.0.1 or the actual IP address of the machine in WOMonitor Host settings and wotaskd properties?  (FWIW, for last couple of years, we have used the Bonjour host.local name style on that machine)
>>>>>>>
>>>>>>> We usually use neither.  We use the name that DNS lookups (reverse lookup working is important too) to the primary IP on that machine.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Chuck Hill             Senior Consultant / VP Development
>>>>>
>>>>> Practical WebObjects - for developers who want to increase their overall knowledge of WebObjects or who are trying to solve specific problems.
>>>>> http://www.global-village.net/products/practical_webobjects
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Do not post admin requests to the list. They will be ignored.
>>>> Webobjects-dev mailing list      (email@hidden)
>>>> Help/Unsubscribe/Update your Subscription:
>>>>
>>>> This email sent to email@hidden
>>>
>>>
>>> _______________________________________________
>>> Do not post admin requests to the list. They will be ignored.
>>> Webobjects-dev mailing list      (email@hidden)
>>> Help/Unsubscribe/Update your Subscription:
>>>
>>> This email sent to email@hidden
>>
>> --
>> Chuck Hill
>> Executive Managing Partner, VP Development and Technical Services
>>
>> Practical WebObjects - for developers who want to increase their overall knowledge of WebObjects or who are trying to solve specific problems.
>> http://www.global-village.net/gvc/practical_webobjects
>>
>> Global Village Consulting ranks 13th in 2012 in BIV's Top 100 Fastest Growing Companies in B.C!
>>
>> Global Village Consulting ranks 44th in 25th annual PROFIT 500 ranking of Canada’s Fastest-Growing Companies by PROFIT Magazine!
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

--
Chuck Hill
Executive Managing Partner, VP Development and Technical Services

Practical WebObjects - for developers who want to increase their overall knowledge of WebObjects or who are trying to solve specific problems.
http://www.global-village.net/gvc/practical_webobjects

Global Village Consulting ranks 13th in 2012 in BIV's Top 100 Fastest Growing Companies in B.C!

Global Village Consulting ranks 44th in 25th annual PROFIT 500 ranking of Canada’s Fastest-Growing Companies by PROFIT Magazine!













 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden
References:
	>Re: WoMonitor "Failed to contact ..." (From: Philippe Rabier <email@hidden>)
	>Re: WoMonitor "Failed to contact ..." (From: Chuck Hill <email@hidden>)
	>Re: WoMonitor "Failed to contact ..." (From: Philippe Rabier <email@hidden>)
Prev by Date: Re: WoMonitor "Failed to contact ..."
Next by Date: Re: WoMonitor "Failed to contact ..."
Previous by thread: Re: WoMonitor "Failed to contact ..."
Next by thread: Re: WoMonitor "Failed to contact ..."
Index(es):
- Date
- Thread