Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Good Message-Passing Library?




Brendan,

I recently cobbled together a cluster out of spare machines lazing about the office.

Initially, I looked at MacMPI, which seemed ideal because of its integration with xgrid. I eventually rejected that option because MacMPI doesn't allow machines to come and go from the cluster. My cluster, affectionately named the Kludgster, is a bit unique as it includes aged prototype hardware units of every description running an equally diverse set of Alpha OS revisions. This is not precisely a recipe for stability. ;-)

I next looked at OpenPlay. This open source effort was originally designed to satisfy the needs of the Macintosh gaming market, which have high throughput low latency needs akin to HPC. It features dynamic discovery of hosts. Clients can come and go as they like. It does both TCP/UDP, and works on a variety of operating systems, incl. MacOS 9, Windows and (I think) Linux. It is several years old, meaning a reasonably mature networking layer to rely on. It may be a good solution for you.  I wasn't able to locate a Mach-o native version of it. It runs CFM Carbon -- I think the Mach-o version has been in development and possibly released separately but I can't recall the details. Please ask on Apple's openplay list, if you are interested. It is still in active development.

    http://developer.apple.com/darwin/projects/openplay/

Once I realized that  TCP sockets were not quite as much work to get up and running as I thought they would be, I just went with that. It took me about a day to clusterize my previously single machine multithreaded embarrasingly parallel app. It might take you a bit longer if you are not familiar with threading, possible endian issues, or the multistep process of establishing a tcp connection between two machines. This allowed me to do many of the smaller experiments that I wanted to do in a reasonable period of time.

Cluster robustness over longer time periods remained a problem, however. For the larger jobs I want to do, I've found it difficult initially to tell the difference between machines that were taking just taking a long time to finish a job vs. machines that had died. After trying a variety of different approaches including various socket timeout schemes, heartbeats and of course ignoring the problem, I settled on the SO_KEEPALIVE socket option to boot dead cluster nodes from the network. The only thing about that that bothers me is the 2 hr timeout. Its a shame to have the whole cluster waiting for 2 hours for a dead node to be laid to rest before it can continue. If this turns out to be unsatisfactory, (perhaps if the cluster grows much larger and loses nodes at a greater rate) I might try TCP out of band communication as an alternative, since this would let me set the dead-check timeout to whatever I want.

It has taken me a week or two to get the cluster as a whole to perform robustly with the individual nodes dropping at around 1/day. Most of the time was spent doing runs for a few days until the whole cluster died due to a collection of failures, figuring out what happened and teaching the software to deal with that gracefully. During that time I was mostly doing other things, so while it took a while, it wasn't time during which I was pulling out my hair.

All and all, it has been an interesting learning experience, but I'd recommend taking a look at OpenPlay or similar networking layer first. 

Ian 
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Scitech mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/scitech/email@hidden

This email sent to email@hidden


Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.