On Oct 14, 2005, at 2:11 PM, Merle McClelland wrote:
On MacOS 10.4.2, a Startup Items daemon process that I am writing
for a client occasionally shows up in Activity Monitor tagged with
"not responding". I believe this also occurred on MacOS 10.3.9.
The details for the process show one "recent hang". However, while
Activity Monitor shows "not responding", the daemon is executing
as expected, and is not hung. Exiting Activity Monitor and
restarting it removes the "not responding" tag, so it looks like
Activity Monitor saw what it considered a hang at some point and
even though the process responded after that point, Activity
Monitor did not clear the "not responding" status.
Looking at the process using top and ps I see nothing out of the
ordinary. Client applications that talk to this daemon via TCP/IP
continue to function and report no problems, and the daemon
itself, which is monitoring devices on the network using TCP/IP,
continues to monitor them as expected. CPU usage is low (2 to 3%),
there doesn't appear to be any memory leaks, and the daemon
process responds to SIGTERM and SIGHUP signals which it uses for
graceful shutdown and restart.
The daemon is using some of my client's cross-platform libraries
that are based on ZThread-2.3.2, with a Carbon-based wrapper. The
wrapper invokes the low-level libraries, and also monitors system
power states using the IOManager and handles signals via standard
signal handlers. Depending on the number of network devices being
monitored by the daemon there can be 10 to 15 threads executing in
the process. The process also uses syslog every two seconds or so
for debugging messages.
Cocoa-based GUI client applications that talk to this daemon are
also using these cross-platform libraries and multiple ZThreads,
but have never been tagged in Activity Monitor as "not responding".
I've searched the mailing list archives and the Internet in
general for references to Activity Monitor and "not responding"
and "recent hangs", and have also searched man pages for some clue
as to what is going on. I haven't found a definition of what
specific process event or state is reflected by these labels in
Activity Monitor.
My question is: what are Activity Monitor's definitions of "not
responding" and "recent hangs"?
"not responding" means that the application is currently "hung".
"recent hangs" is the count of times the application was not
responding, but then became responsive again.
ActivityMonitor sends "ping" AppleEvents to processes to determine
if they are hung. If they respond to the event, then they are not
hung. If they fail to respond to the event, then they are
considered to be in a "Not Responding" state, and for each
transition from "Responding" to "Not Responding" that it sees
counts as a "hang" of the application.
Normally, an application will be written in such a way that it will
at least *periodically* check for AppleEvents, even if it tends to
call from an event from CFRunLoop into a long involved process.
You can get false positives on a "Not Responding" (the same
situation that gets the busy cursor on an application in the GUI,
FWIW) if an application does a huge amount of work in something
called from CFRunLoop.
The normal way to avoid this, if you are actually not hung, is to
either periodically save the state of your work in progress and
restart the work by sending yourself an event, or dispatch the work
to a thread, instead of performing it directly as the result of
some event.
If you take the second approach (which is the best), you have to
deal with the fact that you could get "false negatives", i.e. your
application looks like it's still alive when it's actually dead.
To avoid false negatives, you need to provide a barrier interlock
in your CFRunloop; something like a semaphore or a mutex you take
and drop in the loop will work.
When you dispatch the long duration task to a thread to work on,
*that* thread should immediately take the mutex, effectively
"hanging" the application. At intervals (best handled via an
interval timer or using a work counter), the thread drops and
reacquires the mutex, and when work is complete, it drops the mutex
and goes idle (or exits).
By doing this, you guarantee that if the thread actually becomes
hung, then the CFRunLoop will also become hung because the thread
holds the mutex it needs in order to complete processing the last
even, and process the next one (which might be a ping from Activity
Monitor, or the window server, or some other monitoring process).
That way you only end up with a busy cursor or a Not Responding
status (and an increment of the recent hangs count) if your
application actually becomes hung up.
For more complicated applications, where you might have a lot of
work going on in parallel, you can establish a monitor thread and
indicate status to it. The monitor thread would have a common
status area for each outstanding work item in progress (e.g. a
counter in an array of volatile ints as one example), and wake up
periodically (e.g. via use of an interval timer or some other
mechanism). The monitor thread would take the mutex instead of the
work thread, and periodically wake up. It would compare the
current int counter with a copy of the int counter for each
outstanding work item, and if they were the same, that work item is
hung.
At that point, it can decide to take some action, or it can decide
that it's not going to drop the mutex this time around. In
general, you probably only want to decide not to drop the mutex if
you only have one work thread outstanding and it's hung; even then,
it's more likely you'll want to put up a dialog box offering the
user some way to correct (or not) the errant thread that's not
making progress on its work item.
Hope that helps!
-- Terry
Thanks for the detailed response! That answers my questions. I wasn't
aware of the AppleEvent pinging since my daemon isn't explicitly
using AppleEvents.
Most of the work performed by the daemon is within separate threads
invoked by the main process. Most of the time the main process is in
the CFRunLoop, waiting for an event. However, periodically, the main
process does do some work within the context of the timer and the
IOManager callbacks. Based on your explanation, I suspect that it is
this processing (which sometimes can cause the main process to wait
for the other threads to do their stuff before exiting the callback)
that is causing an occasional lack of response to the AppleEvent ping.
I'll have to evaluate moving these periodic functions to a background
thread so that the main process doesn't get tied up performing the
work. Sounds like I'll have to implement the interlock as you suggest
since in this design the main process won't be doing much of
anything. Given that my process is a daemon without a GUI (and is
designed to run even when users aren't logged in), I can't directly
display error dialogs. The cross-platform code using ZThread has a
monitoring capability to handle hung worker threads. However, it
appears I need to have one additional thread layer between the main
process and the lower-level threads to off-load the main process from
potentially long-term work that may occur periodically.
Is there a good example of writing a MacOS daemon that properly
handles these scenarios? The "MyFirstDaemon" sample code doesn't
cover these issues.
Thanks again,
- Merle
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Unix-porting mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden