Window-server access bug in parallel programming ... ?
Window-server access bug in parallel programming ... ?
- Subject: Window-server access bug in parallel programming ... ?
- From: Jay Reynolds Freeman <email@hidden>
- Date: Sun, 4 Nov 2007 18:12:35 -0800
I am not sure whether this is the best group for my problem; if not,
please accept my apologies and, er, tell me where to go ...
I have been working on a parallel Lisp implementation (actually,
Scheme, which is a dialect of Lisp), in which separate instances
of the *same* Macintosh application run at the same time, and
share a good deal of memory, both for Scheme object storage
and for interprocess communication. If N copies of my application
are to be launched, the first one is launched in some normal
way, e.g., by mousing on it, and it launches the other N-1
via in essence
system("<path to my application>/Contents/MacOS/<my app's name> &");
(Actually, the call uses argv[0] to get the full path, and adds a few
flags and operands that have to do with identifying which of the
N processes is being started, setting up the mmap, and the like.)
To my complete and utter astonishment, this works like gangbusters --
I really do get N complete instances of my application, each with a
its own main window, menu bar, and so on. (And by the way, much
thanks to Terry Lambert, a few months ago, for hints on why and how
to use mmap.) After a time chasing down deadlocks, critical-section
violations, and other untoward consequences of parallel processing,
I have gotten things to where I begin to see errors that are not
obviously the consequence of my own obtuseness and inadequate coding
skills.
One of them is my subject for tonight. It occurs rarely -- roughly,
once in 10000 runs of my application (I am doing lots of regression
testing) -- actually, that's once in about 2000 runs of 5 parallel
copies of my application at a time; that's about once a day with
my Mac running regression tests nearly full-time. I get a crash with
a crash log, and I won't bother you with too many details.
What I see is a failure in thread zero of my app -- that's the one
where GUI I/O is done, always deep inside a display or displayIfNeeded
of my main window. The actual crash is an EXC_BAD_ACCESS with
code KERN_PROTECTION_FAILURE in a function called szone_free, in
libSystem.B.dylib. This is nested some fifteen or twenty function
calls inside of anything I myself wrote. (And I should say that my
GUI code is written in Cocoa -- my app is model/view/controller,
with the view and controller in cocoa and the model a separate
thread of straight C++.) (I should also say that I am still working
in Tiger, running XCode 2.4.1, Mac OS X 10.4.10, on a 2006 model
13-inch Macbook with an Intel Core-Duo).
While contemplating this problem, I remembered something I had read in
Dalrymple and Hillegass's "Advanced Mac OS X Programming": "... Mach
ports are used for a lot of interprocess communications, particularly
to the window server, and are very important to Cocoa." (p. 366)
I then noticed that every time I launched N parallel copies of my
application, I would get N-1 error messages in the console log, each
of the form
"... CFLog (99): CFMessagePortCreateLocal(): failed to name Mach
port ..."
Perusal of archives suggested that each instance of my application is
trying to open a Mach port based on the CFBundleIdentifier in my app's
Info.plist, and only the first of the N is able to do so -- the others
all find the port name they are looking for in use. The archives I
found suggested that the message was harmless, which blindsided me about
it for a while, perhaps.
Furthermore, the "first of N" of my parallel processes is always the
one that does the system calls to open the others, and the processes
that have crashed have always been one of the other N-1; that is, one
of the ones that did not get the Mach port it wanted.
So I am wondering if failure to open this Mach port (I don't know what
it is actually for, by the way) is in some way causing this extremely
rare failure.
If anyone is still reading, do you have a sense of whether I am on
the right track? Any ideas for a fix? Any ideas for how to instrument
and test to see if I can better understand what is going on?
(For the terminally curious, the application in question is "Wraith
Scheme", described on my web site, whose URL is in my .sig below, but
what's on the web site is *not* the parallel version, that last is
still under development.)
Thanks much!
-- Jay Reynolds Freeman
---------------------
email@hidden
http://web.mac.com/jay_reynolds_freeman (personal web site)
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden