Apple

This turned out to probably be user error (or at least bad design). There was code that set SIGCHLD’s handler to ‘IGNORE’, and perl would gracefully warn us that it couldn’t do that, and changed the behavior to “default”, and then we’d get into this situation. We ultimately swapped in a SIGCHLD that would reap the zombie pid and this problem went away.

So, I’m not sure whether or not it’d have been correct behavior for the kernel call to behave in this fashion if an app wasn’t reaping dead child pids; perhaps someone else can comment on that. Fortunately, it’s a moot question for us.

From: Dave Zarzycki [mailto:email@hidden]
Sent: Thursday, January 24, 2008 9:03 AM
To: Nathan Herring
Cc: email@hidden
Subject: Re: perl 5.8.8, backtick execution and leopard

This sounds like a kernel bug. We'd need to attach a kernel debugger to investigate further. Also, what is the third argument to wait4()? If the WNOHANG flag is being passed, then this is definitely a kernel bug. One more thing, if you're running dtrace, then please probe the following kernel APIs: proc_reparent() and ptrace() and let us know if that call is ever being made on your box (you'll need to avoid using gdb during this test).

Thanks!

davez

On Jan 23, 2008, at 3:42 PM, Nathan Herring wrote:

We have a lightweight perl-based HTTP server running. The main loop looks like accept() (with a 10s timeout), and if it timed out, we run an “OnTick” periodic task. After upgrading from 10.4.10 to 10.5.1, the server will eventually stop responding. We’ve tracked it down to perl hanging in __wait4() underneath a backtick execution (calling df to determine whether we need to do disk space cleanup commands). wait4()’s argument is a pid that isn’t in the process list, but nonetheless, the function never returns. This doesn’t seem to happen until such time as the HTTP server kicks off a local process that uses the machine heavily (including making major edits in the directory on which df is called). Because the pid isn’t around, I don’t think df is the culprit, but I cannot fathom why wait4 would get stuck. (From gdb’s perspective, there’s only one thread, so...)

Is this a known issue and/or is there something I can do to track it down further or work around it? (I was using Instruments with tracing the user function Perl_my_popen with argument1 (the cmd), and the syscall to wait4 for both entry and exit to see that exit isn’t happening.)

-nh