Re: hangs in flockfile() during fread() or fclose()
On Wed, 2014-07-09 at 21:46 -0700, Greg Parker wrote:
On Jul 9, 2014, at 9:26 PM, Paul Smith <paul@mad-scientist.net> wrote:
On Wed, 2014-07-09 at 19:55 -0500, Stephen J. Butler wrote:
Can you distill this down to self contained test case?
I'm wondering if someone has pointers on what we might investigate (and how) when we get a process in this state, that might help us narrow down where to look or what to concentrate on.
Some possibilities include: * That thread is deadlocked against itself because it's trying to call fread() from a signal handler and the signal handler interrupted another flockfile-ing call. What is the rest of that stack trace?
Nope, we don't do signal handlers. Signals are either ignored or cause the process to crash; we don't register any handlers. The rest of the stacktrace is unremarkable; a bunch of our internal functions and at the top (bottom?) of the stack: #17 0x00007fff88cd4899 in _pthread_body () #18 0x00007fff88cd472a in _pthread_start () #19 0x00007fff88cd8fc9 in thread_start ()
* The process is deadlocked because some other thread owns the lock and won't let go for some reason. What are the other threads' stack traces?
They're all waiting in recv() or sleep(). But, unless I'm badly misunderstanding something this cannot be true because the FILE* is opened, used, and closed within this function and the lock is local to the FILE object and never shared between different FILE objects. The one thing I was thinking is this: maybe I have an fopen() then an fclose(), but then something uses the FILE* again after the fclose() which causes the stdio structure to be corrupted somehow. Then that FILE* object is used for another fopen() and this causes the problem. I can't find anyplace that this happens, but I will look harder.
* The lock is broken because a memory error smashed it. What does the memory contents of the lock look like? (I don't know what the internals of the current pthread mutex looks like, but the first four bytes should be something similar to 'MUTX'.)
OK, memory errors are always a possibility. I'll see if I can dig into the lock contents. On Thu, 10 Jul 2014 at 00:04 -0500, Stephen J. Butler wrote:
Incredible claims -- like stdio being broken -- require incredible evidence.
I definitely agree, that's why I didn't say stdio was broken :-) (or at least, I didn't mean to do so). I'm just reporting what I see and asking for tips on where to go from here.
From our perspective, it's much more likely that your code has a memory corruption, double free, or stack smashing bug somewhere than stdio not working correctly. You might want to try Malloc Debug if you haven't already:
https://developer.apple.com/library/mac/documentation/performance/conceptual...
valgrind, although a pain to setup, has also helped me find memory related bugs in the past.
We do run valgrind on GNU/Linux, but not MacOS. 99% of the code is identical between them, but there are a few differences (but none related to file IO; in fact that's why we're using stdio instead of open/read/close: for portability--I've considered avoiding this problem by switching to system calls as there's no real benefit to stdio the way we use it except for portability). The problem of course is that even with an optimized build the full tests take 6+ hours and the failure doesn't always happen, and running that under valgrind or even a debugging malloc will take significantly longer (and could potentially change the timing so it never happens). However, I guess it's up to me now. I'll follow up if I figure out anything. Cheers! _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-dev mailing list (Darwin-dev@lists.apple.com) Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.app... This email sent to site_archiver@lists.apple.com
participants (1)
-
Paul Smith