Re: Deadlocks in one of our apps
Re: Deadlocks in one of our apps
- Subject: Re: Deadlocks in one of our apps
- From: Pascal Robert <email@hidden>
- Date: Wed, 2 Jun 2010 11:51:13 -0400
Le 10-06-02 à 11:39, Chuck Hill a écrit :
On Jun 2, 2010, at 8:16 AM, Pascal Robert wrote:
Le 10-06-02 à 10:30, Chuck Hill a écrit :
That makes your code look guilty then. :-)
Funny thing is that he not really my code (eg, I didn't write it)
but this is code dated from WO 5.2. It's just that this app never
had that much traffic.
And I did try stress loading this app with JMeter, but since the
URL is changed when the long response page is called (session ID is
put back in the URL) and I don't know how to fix this, that part
was not stress loaded.
Check your long response page implementation again. Are there any
exceptions in the log that might be related?
Just to explain a bit more :
- It's a (non public) online store. When people log in, we create a
order in memory and customers add order items to the order. We
don't store anything in the DB until the payment is made with
PayFlow. When we get the response from PayFlow, we store a copy of
the order (and the items) to our Oracle db. After that, we contact
our SQL Server db (actually, a accounting system, and we send the
data to a stored procedure), and we get the invoice number produced
by the accounting system and store it in the order EO in Oracle.
So in summary :
- People login, we create a order EO, the EO is created in the
session's editing context
- People add items to the order
- They start the order payment steps
- Long response page kicks in
- We contact PayFlow to make the payment
- If the payment is succesful, we store the order in Oracle
- We create a new EO, in a different EC, for SQL Server
- We update the order EO to store the invoice number in Oracle
- We generate (FOXML, generated in a separated JVM) the invoice in
PDF
- Long response page is done, pageForResult is called
Everything is done in session.defaultEditingContext EXCEPT the SQL
Server EOs,
You are not using the session.defaultEditingContext in the long
response page, are you? I am pretty sure that is an excellent
source of deadlocks.
Hum, yes we do use in the long response page... But since
localInstanceOfObject won't let me have a copy in a new EC, what are
the options except not using the session EC?
Chuck
where we create a new EOObjectStore, create a new EOEditingContext
inside the new object store, and
EOObjectStore osc = new EOObjectStoreCoordinator();
EOEditingContext ec = new EOEditingContext(osc);
ec.lock();
try {
CommandesEcom commandeEcom = CommandesEcom.creerCommandesEcom(ec);
...
ec.saveChanges();
finally {
ec.unlock();
ec.dispose();
osc.dispose();
ec = null;
osc = null;
}
A co-worker suggested that we create a new editing context in the
long response page, and call EOUtilities.localInstanceOfObject to
have a copy of the order EO in the new EC, but the resulting EO is
null, even if the source is not.
I'd also reduce the Maximum Adaptor threads (JavaMonitor ->
Application configuration -> Application settings). 6 or 8 is
probably more than enough for this app. That will at least reduce
the size of the thread dumps. I'd also trim down the listen
queue size to 2 or 4, might as well catch this as soon as possible.
Chuck
On Jun 2, 2010, at 5:19 AM, Pascal Robert wrote:
... And going back to the physical server didn't solve anything,
I got the same deadlock this morning.
Ok, so I will move back the DB to the physical server to see if
the problem goes away.
On Jun 1, 2010, at 6:34 AM, Pascal Robert wrote:
Hum... And after I started using ERXWOLongResponsePage, I
still got a deadlock, but this time, it says that it's a
EODatabaseContext lock :
Thread t@92163: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- java.lang.Object.wait() @bci=2, line=474 (Interpreted frame)
- com.webobjects.foundation.NSRecursiveLock.lock() @bci=54,
line=72 (Interpreted frame)
- com.webobjects.eoaccess.EODatabaseContext.lock() @bci=56,
line=1973 (Interpreted frame)
-
com
.webobjects
.eocontrol
.EOObjectStoreCoordinator
.addCooperatingObjectStore
(com.webobjects.eocontrol.EOCooperatingObjectStore) @bci=5,
line=130 (Interpreted frame)
-
com
.webobjects
.eoaccess
.EODatabaseChannel
.setCurrentEditingContext
(com.webobjects.eocontrol.EOEditingContext) @bci=34, line=166
(Interpreted frame)
...
We don't "manual" (eg , in code) locking at the
EODatabaseContext level.
It is possible that an odd exception in EOAccess or below is
resulting in this not getting unlocked. Joe's reply below
might be what is happening to you.
Chuck
Another thing to note if this is a long request to a database
housed in an ESX vm. We had similar problems with long
requests timing out between two systems, with one hosted by
esx 4.x. Such long requests were caught by some low level
interface muxing issue and my whole EOF stack was frozen when
the underlying db connection was lost mid-transaction. I
resolved it by moving this application off of a vm.
On May 31, 2010, at 5:33 PM, Pascal Robert <email@hidden>
wrote:
Ok, will try with ERXWOLongResponsePage since it look like
it's locking and unlocking all ECs in the thread.
There's a bunch of stuff wrong here. First, the only
actually locked thread is:
-
com
.webobjects
.eocontrol
.EOObjectStoreCoordinator
.addCooperatingObjectStore
(com.webobjects.eocontrol.EOCooperatingObjectStore) @bci=5,
line=130 (Interpreted frame)
-
com
.webobjects
.eoaccess
.EODatabaseChannel
.setCurrentEditingContext
(com.webobjects.eocontrol.EOEditingContext) @bci=34,
line=166 (Interpreted frame)
-
com
.webobjects
.eoaccess
.EODatabaseChannel
._selectWithFetchSpecificationEditingContext
(com.webobjects.eocontrol.EOFetchSpecification,
com.webobjects.eocontrol.EOEditingContext) @bci=158,
line=788 (Interpreted frame)
-
com
.webobjects
.eoaccess
.EODatabaseChannel
.selectObjectsWithFetchSpecification
(com.webobjects.eocontrol.EOFetchSpecification,
com.webobjects.eocontrol.EOEditingContext) @bci=64,
line=215 (Interpreted frame)
-
com
.webobjects
.eoaccess
.EODatabaseContext
._objectsWithFetchSpecificationEditingContext
(com.webobjects.eocontrol.EOFetchSpecification,
com.webobjects.eocontrol.EOEditingContext) @bci=219,
line=3205 (Interpreted frame)
-
com
.webobjects
.eoaccess
.EODatabaseContext
.objectsWithFetchSpecification
(com.webobjects.eocontrol.EOFetchSpecification,
com.webobjects.eocontrol.EOEditingContext) @bci=34,
line=3346 (Interpreted frame)
-
com
.webobjects
.eocontrol
.EOObjectStoreCoordinator
.objectsWithFetchSpecification
(com.webobjects.eocontrol.EOFetchSpecification,
com.webobjects.eocontrol.EOEditingContext) @bci=97,
line=539 (Interpreted frame)
-
com
.webobjects
.eocontrol
.EOEditingContext
.objectsWithFetchSpecification
(com.webobjects.eocontrol.EOFetchSpecification,
com.webobjects.eocontrol.EOEditingContext) @bci=79,
line=4114 (Interpreted frame)
-
er
.extensions
.eof
.ERXEC
.objectsWithFetchSpecification
(com.webobjects.eocontrol.EOFetchSpecification,
com.webobjects.eocontrol.EOEditingContext) @bci=72,
line=1211 (Interpreted frame)
-
com
.webobjects
.eocontrol
.EOEditingContext
.objectsWithFetchSpecification
(com.webobjects.eocontrol.EOFetchSpecification) @bci=3,
line=4500 (Interpreted frame)
-
com
.acaiq
.fondation
.acaiqCore
._Licence
.fetchLicences(com.webobjects.eocontrol.EOEditingContext,
com.webobjects.eocontrol.EOQualifier,
com.webobjects.foundation.NSArray) @bci=19, line=1062
(Interpreted frame)
-
com
.acaiq
.fondation
.acaiqCore
._Membre.licences(com.webobjects.eocontrol.EOQualifier,
com.webobjects.foundation.NSArray, boolean) @bci=77,
line=8920 (Interpreted frame)
-
com
.acaiq
.fondation
.acaiqCore
._Membre.licences(com.webobjects.eocontrol.EOQualifier,
boolean) @bci=4, line=8893 (Interpreted frame)
-
com
.acaiq
.fondation
.acaiqCore
.Membre
.licencesParEtats
(com.acaiq.fondation.acaiqCore.EtatMembre[]) @bci=100,
line=980 (Interpreted frame)
- com.acaiq.fondation.acaiqCore.Membre.licencesValides()
@bci=11, line=996 (Interpreted frame)
- com.acaiq.fondation.acaiqCore.Membre.estCourtier()
@bci=5, line=1035 (Interpreted frame)
-
sun
.reflect.GeneratedMethodAccessor87.invoke(java.lang.Object,
java.lang.Object[]) @bci=40 (Interpreted frame)
Which reminds me of an unlocked EC/OSC. Second:
java.lang.IllegalArgumentException: Attribute
noCommandeOracle can't receive a null parameter :
at
com
.acaiq
.fondation
.depot
.lbaArticle
._CommandesEcom.setNoCommandeOracle(_CommandesEcom.java:419)
This is a *template* that throws on null?? You sure that's
such a bright idea? Isn't this what validation is for? And
third:
at
com
.acaiq
.depot
.component
.TransactionAchat.performAction(TransactionAchat.java:63)
at
com
.webobjects
.woextensions
.WOLongResponsePage.run(WOLongResponsePage.java:119)
As you're throwing from inside a normal
com.webobjects.woextensions.WOLongResponsePage, I seriously
hope you're doing your part of try{} finally{} and EC
unlocking.
Cheers, Anjo
Am 31.05.2010 um 20:02 schrieb Pascal Robert:
One of our apps have deadlocked 5 times over 3 days,
strangely enough it started when we moved our Oracle
Database 10gR2 DB to our VMWare ESX 4.0 cluster. e didn't
re-install Oracle, I simply did a P2V (Physical to VM)
conversion, so it's the exact same version of Oracle DB as
before.
What's happenning is that we store some information on our
Oracle database, save it, and we built a copy of some of
the data to a new EO (different entity) in a SQL Server
2005 db so the accounting system take care of billing.
The exception that cause the deadlock (or at least the
last thing written to the log before the deadlock) :
java.lang.IllegalArgumentException: Attribute
noCommandeOracle can't receive a null parameter :
at
com
.acaiq
.fondation
.depot
.lbaArticle
._CommandesEcom.setNoCommandeOracle(_CommandesEcom.java:419)
at
com
.acaiq.fondation.depot.Caissier.copiePourLBA(Caissier.java:
267)
at com.acaiq.fondation.depot.Caissier.paye(Caissier.java:
137)
at
com
.acaiq
.depot
.component
.TransactionAchat.performAction(TransactionAchat.java:63)
at
com
.webobjects
.woextensions
.WOLongResponsePage.run(WOLongResponsePage.java:119)
at java.lang.Thread.run(Thread.java:613)
And it happens here :
commandeEcom.setNoCommandeOracle(((Integer)
_commande.clefsPrimaire()));
_commande.clefsPrimaire() is a method that simply do :
return
ERXEOControlUtilities.primaryKeyObjectForObject(this);
So clefsPrimaire() returns null, even if the data was
stored in the Oracle DB (and it's really there) and a
primary key was generated (EOF did it, it's not a "human
generated" PK).
The whole block :
try {
_commande.setEstPaye(true);
_commande.editingContext().saveChanges(); // This is
when we save the EO in Oracle
} catch (Exception e) {
NSLog.err.appendln(e.getMessage());
remboursementPayflow
((Transaction)pfd.valueForKey("transaction"),_commande,
(String) pfd.valueForKey("PNREF"));
pfd = new NSDictionary<Object, String>(CODE_RESUTAT,
"REMBOURSEMENT");
} finally {
try {
copiePourLBA(_commande); // This is the method where we
copy some data to SQL Server
} catch (Exception e) {
NSLog.err.appendln(e.getMessage());
}
_commande.editingContext().saveChanges();
}
This problem didn't happen in the past (but we also had
less requests coming him) and it doesn't always happen :-/
jstack give me tons of this :
Thread t@71683: (state = BLOCKED)
- java.net.PlainSocketImpl.accept(java.net.SocketImpl)
@bci=0, line=382 (Interpreted frame)
- java.net.ServerSocket.implAccept(java.net.Socket)
@bci=50, line=450 (Interpreted frame)
- java.net.ServerSocket.accept() @bci=48, line=421
(Interpreted frame)
- com.webobjects.appserver._private.WOWorkerThread.run()
@bci=26, line=238 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=613 (Interpreted
frame)
The only non-blocked thread is :
Thread t@78083: (state = IN_NATIVE)
-
java.net.PlainSocketImpl.socketAccept(java.net.SocketImpl)
@bci=0 (Interpreted frame)
- java.net.PlainSocketImpl.accept(java.net.SocketImpl)
@bci=7, line=384 (Interpreted frame)
- java.net.ServerSocket.implAccept(java.net.Socket)
@bci=50, line=450 (Interpreted frame)
- java.net.ServerSocket.accept() @bci=48, line=421
(Interpreted frame)
- com.webobjects.appserver._private.WOWorkerThread.run()
@bci=26, line=238 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=613 (Interpreted
frame)
Beside going back to our Oracle physical server, I have no
idea of why I'm getting this. Java 1.5 on OS X 10.4.11
Server, WO 5.3.3. W
<pid23991.txt>
----
Pascal Robert
email@hidden
AIM: MacTICanada
Twitter : MacTICanada
LinkedIn : http://www.linkedin.com/in/macti
WO Community profile : http://wocommunity.org/page/member?name=probert
--
Chuck Hill Senior Consultant / VP Development
Practical WebObjects - for developers who want to increase their
overall knowledge of WebObjects or who are trying to solve
specific problems.
http://www.global-village.net/products/practical_webobjects
--
Chuck Hill Senior Consultant / VP Development
Practical WebObjects - for developers who want to increase their
overall knowledge of WebObjects or who are trying to solve specific
problems.
http://www.global-village.net/products/practical_webobjects
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden