Re: hangs in flockfile() during fread() or fclose()

10 Jul 2014

      On Wed, 2014-07-09 at 21:46 -0700, Greg Parker wrote:
...
On Jul 9, 2014, at 9:26 PM, Paul Smith <paul@mad-scientist.net> wrote:
...
On Wed, 2014-07-09 at 19:55 -0500, Stephen J. Butler wrote:
...
Can you distill this down to self contained test case?
I'm wondering if someone has pointers on what we might investigate (and
how) when we get a process in this state, that might help us narrow down
where to look or what to concentrate on.
Some possibilities include:
* That thread is deadlocked against itself because it's trying to call
fread() from a signal handler and the signal handler interrupted
another flockfile-ing call. What is the rest of that stack trace?
Nope, we don't do signal handlers.  Signals are either ignored or cause
the process to crash; we don't register any handlers.  The rest of the
stacktrace is unremarkable; a bunch of our internal functions and at the
top (bottom?) of the stack:

#17 0x00007fff88cd4899 in _pthread_body ()
#18 0x00007fff88cd472a in _pthread_start ()
#19 0x00007fff88cd8fc9 in thread_start ()
...
* The process is deadlocked because some other thread owns the lock
and won't let go for some reason. What are the other threads' stack
traces?
They're all waiting in recv() or sleep().  But, unless I'm badly
misunderstanding something this cannot be true because the FILE* is
opened, used, and closed within this function and the lock is local to
the FILE object and never shared between different FILE objects.

The one thing I was thinking is this: maybe I have an fopen() then an
fclose(), but then something uses the FILE* again after the fclose()
which causes the stdio structure to be corrupted somehow.  Then that
FILE* object is used for another fopen() and this causes the problem.

I can't find anyplace that this happens, but I will look harder.
...
* The lock is broken because a memory error smashed it. What does the
memory contents of the lock look like? (I don't know what the
internals of the current pthread mutex looks like, but the first four
bytes should be something similar to 'MUTX'.)
OK, memory errors are always a possibility.  I'll see if I can dig into
the lock contents.

On Thu, 10 Jul 2014 at 00:04 -0500, Stephen J. Butler wrote:
...
Incredible claims -- like stdio being broken -- require incredible
evidence.
I definitely agree, that's why I didn't say stdio was broken :-) (or at
least, I didn't mean to do so).  I'm just reporting what I see and
asking for tips on where to go from here.
...
From our perspective, it's much more likely that your code has a
memory corruption, double free, or stack smashing bug somewhere than
stdio not working correctly. You might want to try Malloc Debug if you
haven't already:
https://developer.apple.com/library/mac/documentation/performance/conceptual...
...
valgrind, although a pain to setup, has also helped me find memory
related bugs in the past.
We do run valgrind on GNU/Linux, but not MacOS.  99% of the code is
identical between them, but there are a few differences (but none
related to file IO; in fact that's why we're using stdio instead of
open/read/close: for portability--I've considered avoiding this problem
by switching to system calls as there's no real benefit to stdio the way
we use it except for portability).

The problem of course is that even with an optimized build the full
tests take 6+ hours and the failure doesn't always happen, and running
that under valgrind or even a debugging malloc will take significantly
longer (and could potentially change the timing so it never happens).

However, I guess it's up to me now.  I'll follow up if I figure out
anything.

Cheers!

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list      (Darwin-dev@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.app...

This email sent to site_archiver@lists.apple.com

Paul Smith

tags

participants (1)