I recently cobbled together a cluster out of spare machines lazing
about the office.
Initially, I looked at MacMPI, which seemed ideal because of its
integration with xgrid. I eventually rejected that option because
MacMPI doesn't allow machines to come and go from the cluster. My
cluster, affectionately named the Kludgster, is a bit unique as it
includes aged prototype hardware units of every description running an
equally diverse set of Alpha OS revisions. This is not precisely a
recipe for stability. ;-)
I next looked at OpenPlay. This open source effort was originally
designed to satisfy the needs of the Macintosh gaming market, which
have high throughput low latency needs akin to HPC. It features dynamic
discovery of hosts. Clients can come and go as they like. It does both
TCP/UDP, and works on a variety of operating systems, incl. MacOS 9,
Windows and (I think) Linux. It is several years old, meaning a
reasonably mature networking layer to rely on. It may be a good
solution for you. I wasn't able to locate a Mach-o native version of
it. It runs CFM Carbon -- I think the Mach-o version has been in
development and possibly released separately but I can't recall the
details. Please ask on Apple's openplay list, if you are interested. It
is still in active development.
Once I realized that TCP sockets were not quite as much work to get up
and running as I thought they would be, I just went with that. It took
me about a day to clusterize my previously single machine multithreaded
embarrasingly parallel app. It might take you a bit longer if you are
not familiar with threading, possible endian issues, or the multistep
process of establishing a tcp connection between two machines. This
allowed me to do many of the smaller experiments that I wanted to do in
a reasonable period of time.
Cluster robustness over longer time periods remained a problem,
however. For the larger jobs I want to do, I've found it difficult
initially to tell the difference between machines that were taking just
taking a long time to finish a job vs. machines that had died. After
trying a variety of different approaches including various socket
timeout schemes, heartbeats and of course ignoring the problem, I
settled on the SO_KEEPALIVE socket option to boot dead cluster nodes
from the network. The only thing about that that bothers me is the 2 hr
timeout. Its a shame to have the whole cluster waiting for 2 hours for
a dead node to be laid to rest before it can continue. If this turns
out to be unsatisfactory, (perhaps if the cluster grows much larger and
loses nodes at a greater rate) I might try TCP out of band
communication as an alternative, since this would let me set the
dead-check timeout to whatever I want.
It has taken me a week or two to get the cluster as a whole to perform
robustly with the individual nodes dropping at around 1/day. Most of
the time was spent doing runs for a few days until the whole
cluster died due to a collection of failures, figuring out what
happened and teaching the software to deal with that gracefully. During
that time I was mostly doing other things, so while it took a while, it
wasn't time during which I was pulling out my hair.
All and all, it has been an interesting learning experience, but I'd
recommend taking a look at OpenPlay or similar networking layer first.
Ian
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Scitech mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/scitech/email@hidden