Re: hangs in flockfile() during fread() or fclose()
Re: hangs in flockfile() during fread() or fclose()
- Subject: Re: hangs in flockfile() during fread() or fclose()
- From: Paul Smith <email@hidden>
- Date: Thu, 10 Jul 2014 10:30:49 -0400
On Thu, 2014-07-10 at 08:53 -0400, Paul Smith wrote:
> However, I guess it's up to me now. I'll follow up if I figure out
> anything.
Hm. OK, I think I have a smoking gun.
I looked at the core again and it turns out there's ANOTHER thread,
which is ALSO hung in flockfile(). I didn't notice it the first time.
So, a real deadlock situation rather than memory stomp.
Examining the other thread, I see a bug in our code: we are writing
logging and we check to see if the log FILE* is NULL and if so we write
to stdout instead. We do this properly everywhere except for the
fflush() operation at the end; here we just pass the log FILE* without
checking. And of course, if you call fflush(NULL) then it will try to
flush all open file descriptors, and this is where it hangs:
Thread 7 (core thread 6):
#0 0x00007fff8a997746 in __psynch_mutexwait ()
#1 0x00007fff88cd7779 in _pthread_mutex_lock ()
#2 0x00007fff856c0edd in flockfile ()
#3 0x00007fff856c156f in sflush_locked ()
#4 0x00007fff856c3e82 in _fwalk ()
#5 0x0000000101d9112f in Engine::logger (this=<unavailable, message=<unavailable>) at /Users/build/src/Logger.cpp:887
Looking at the code in libc I _think_ I see the problem:
* In fopen() if we need to get new FILE objects it appears to me
that they are added into the list _before_ they are completely
initialized; in particular before the INITEXTRA() macro, which
is what initializes the _fl_lock mutex, is run.
* In _fwalk() we walk the list of open FILE objects without taking
a lock; the comment says:
* It should be safe to walk the list without locking it;
* new nodes are only added to the end and none are ever
* removed.
Then _fwalk() passes the FILE object to sflush_locked() which calls
flockfile() on it.
So I believe that it's possible for one thread to be calling flockfile()
on a FILE object with an uninitialized _fl_lock mutex, leading to
corruption.
I will change our code, since we definitely don't want to be calling
fflush(NULL), but it seems also to be a problem in libc since this kind
of thing is exactly why flockfile() is there, IIUC.
Am I understanding the code correctly, or did I miss something? Should
I file a bug? If so where's the best place?
Cheers!
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden