Running “sudo jstack –F <process id>” should dump a trace of all threads. Should…
Luckily, it did; just it is sort of weird that it identifies threads by name in the deadlock report, but by some unknown ID (perhaps hash of the Thread instance, looks like
that) in the traces, which makes for sort of difficult parsing. I guess I'm not the first one to swear :)
... As for the hang, the code in the logging does seem like a likely culprit.
Actually it looks like the culprit is log4j itself?!? What the heck! Anyroad, here's the deadlock report:
"model.ReportPDFTask@5f6ce9a5":
waiting to lock Monitor@0x000000012c0008a8 (Object@0x00000003f7efb980, a org/apache/log4j/spi/RootLogger),
which is held by "WorkerThread6"
waiting for ownable synchronizer 0x00000003f9538960, (a java/util/concurrent/locks/ReentrantLock$NonfairSync),
which is held by "model.ReportPDFTask@5f6ce9a5"
the WorkerThread6 problem happens in the “log.info("$tch")” part, it looks like this (and although changesFromCommittedSnapshot happens to be part of the culprit, I believe
the problem is actually caused by the logger itself): instead of rendering its argument immediately, as soon as it is determined that we should log, it does that self-evidently under a lock:
=== Worker6 (cleaned up considerably) ===
- er.extensions.eof.ERXEC.lockObjectStore() // the lock W6 [2] hangs on, caused by...
- er.extensions.eof.ERXGenericRecord.changesFromCommittedSnapshot() // ... changesFromCommittedSnapshot ...
- cz.ocs.model.OCSEnterpriseObject.toString() // ... which yours truly dumbly uses in toString, alas!
- org.apache.log4j.or.DefaultRenderer.doRender(java.lang.Object) // ... this thing happens INSIDE of log4j code
... lot of frames here; self-evidently, some of them acquires and holds Monitor@0x000000012c0008a8 [1] ...
- org.apache.log4j.Category.info(java.lang.Object) // this is the log.info("$tch") of mine
... the loop and the other standard stuff up to dispatchRequest and WOWorkerThread.run() ...
The "model.ReportPDFTask" actually does not do anything wrong (far as I can say) -- it just fetches. A fetch presumably (and understandably) would lock the object store. With
locked object store, it... logs! (I do not think the stack trace is important here; anyway, I have attached it at the end of this message for reference.)
Nothing wrong there I believe, it feels right to be able to log with a locked object store; but it causes deadlock, for the log tries to acquire its Monitor@0x000000012c0008a8
-- which is held by Worker6.
Now, although in my case the culprit happens to be unneeded (and generally dangerous) changesFromCommittedSnapshot, I believe that
(a) whatever which locks object store in the log would cause deadlock just as well
(b) which can be e.g., a fault fired, or lots of other perfectly valid things to log.
I do understand why log4j methods accept generic Object and render the result inside -- it is reasonable not to render anything unless we want to log with the current setting.
What seems to me to be a grave bug though is that this rendering happens under a lock, which causes a deadlock if two threads log concurrently and it so happens that
(i) one of them logs under object store lock (which I believe is valid and correct)
(ii) the other's log contents causes an object store lock (which again I believe is valid and correct, see (a) and (b) above).
So, well, is there indeed a grave bug inside of the ubiquitous log4j, or am I overlooking something of importance?
And if there is a bug, is there anything better one can do to work around it than using toStrings for all non-trivial logs, like “log.info("non-trivial-contents".toString())”,
which would affect efficiency pretty bad?
=== "model.ReportPDFTask@5f6ce9a5" slightly cleaned up ===
- org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) @bci=12, line=204 (Compiled frame) // hangs on Monitor@0x000000012c0008a8, which is held by [1]
above
- org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) @bci=14, line=391 (Compiled frame)
- org.apache.log4j.Category.info(java.lang.Object) @bci=38, line=666 (Compiled frame)
- er.extensions.eof.ERXEOAccessUtilities.logExpression(com.webobjects.eoaccess.EOAdaptorChannel, com.webobjects.eoaccess.EOSQLExpression, long) @bci=406, line=1307 (Compiled
frame)
- er.extensions.eof.ERXAdaptorChannelDelegate.adaptorChannelDidEvaluateExpression(com.webobjects.eoaccess.EOAdaptorChannel, com.webobjects.eoaccess.EOSQLExpression) @bci=76,
line=81 (Compiled frame)
- com.webobjects.foundation._NSDelegate.perform(java.lang.String, java.lang.Object, java.lang.Object) @bci=14, line=163 (Compiled frame)
- er.extensions.jdbc.ERXJDBCAdaptor$Channel.evaluateExpression(com.webobjects.eoaccess.EOSQLExpression) @bci=2, line=286 (Compiled frame)
- com.webobjects.eocontrol.EOObjectStoreCoordinator.objectsWithFetchSpecification(com.webobjects.eocontrol.EOFetchSpecification, com.webobjects.eocontrol.EOEditingContext)
@bci=97, line=488 (Compiled frame) // somewhere in this code probably object store is locked [2], which sort of makes sense
- com.webobjects.eocontrol.EOEditingContext.objectsWithFetchSpecification(com.webobjects.eocontrol.EOFetchSpecification, com.webobjects.eocontrol.EOEditingContext) @bci=79,
line=4069 (Interpreted frame)
- er.extensions.eof.ERXEC.objectsWithFetchSpecification(com.webobjects.eocontrol.EOFetchSpecification, com.webobjects.eocontrol.EOEditingContext) @bci=10, line=1308 (Interpreted
frame)
- model._DBReport.fetchDBReports(com.webobjects.eocontrol.EOEditingContext, com.webobjects.eocontrol.EOQualifier, com.webobjects.foundation.NSArray) @bci=103, line=141 (Interpreted
frame) // my code which fetches
... the rest, I believe, is irrelevant ...
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Help/Unsubscribe/Update your Subscription: