On Jul 10, 2014, at 7:30 AM, Paul Smith <paul@mad-scientist.net> wrote:
On Thu, 2014-07-10 at 08:53 -0400, Paul Smith wrote:
However, I guess it's up to me now. I'll follow up if I figure out anything.
Hm. OK, I think I have a smoking gun.
I looked at the core again and it turns out there's ANOTHER thread, which is ALSO hung in flockfile(). I didn't notice it the first time. So, a real deadlock situation rather than memory stomp.
Examining the other thread, I see a bug in our code: we are writing logging and we check to see if the log FILE* is NULL and if so we write to stdout instead. We do this properly everywhere except for the fflush() operation at the end; here we just pass the log FILE* without checking. And of course, if you call fflush(NULL) then it will try to flush all open file descriptors, and this is where it hangs:
Thread 7 (core thread 6): #0 0x00007fff8a997746 in __psynch_mutexwait () #1 0x00007fff88cd7779 in _pthread_mutex_lock () #2 0x00007fff856c0edd in flockfile () #3 0x00007fff856c156f in sflush_locked () #4 0x00007fff856c3e82 in _fwalk () #5 0x0000000101d9112f in Engine::logger (this=<unavailable, message=<unavailable>) at /Users/build/src/Logger.cpp:887
Looking at the code in libc I _think_ I see the problem:
* In fopen() if we need to get new FILE objects it appears to me that they are added into the list _before_ they are completely initialized; in particular before the INITEXTRA() macro, which is what initializes the _fl_lock mutex, is run. * In _fwalk() we walk the list of open FILE objects without taking a lock; the comment says: * It should be safe to walk the list without locking it; * new nodes are only added to the end and none are ever * removed.
Then _fwalk() passes the FILE object to sflush_locked() which calls flockfile() on it.
So I believe that it's possible for one thread to be calling flockfile() on a FILE object with an uninitialized _fl_lock mutex, leading to corruption.
I will change our code, since we definitely don't want to be calling fflush(NULL), but it seems also to be a problem in libc since this kind of thing is exactly why flockfile() is there, IIUC.
Am I understanding the code correctly, or did I miss something? Should I file a bug? If so where's the best place?
Your diagnosis sounds plausible to me. That code needs appropriate memory barriers if it wants to play games with lock-free algorithms, and I don't see any. Please file a bug report from http://bugreport.apple.com. -- Greg Parker gparker@apple.com Runtime Wrangler _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-dev mailing list (Darwin-dev@lists.apple.com) Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.app... This email sent to site_archiver@lists.apple.com