Road to 80 cores ( Was Re: Grand Central Details)
Road to 80 cores ( Was Re: Grand Central Details)
- Subject: Road to 80 cores ( Was Re: Grand Central Details)
- From: "Pierce T. Wetter III" <email@hidden>
- Date: Wed, 18 Jun 2008 10:50:53 -0700
Topic/focus changed to please the moderator.
Too bad you can't avoid blocking at least occasionally with the
event-driven APIs, meaning you still have to use threads to avoid
it completely. And I fail to see what's so bad about having one
thread per socket. Is it because Threads Are Hard?
In that case, it is because threads are relatively expensive.
Every thread adds a bit of memory use -- not insignificant given
each individual thread's stack -- and scheduling overhead. In this
model, you'd expect that most threads will be blocked on I/O most of
the time, but you might also likely find that performance goes to
hell in a handbasket as soon as multiple sockets are lit up with
inbound data.
And, yes, threads are hard, though -- in this case -- that hardness
is a bit irrelevant in that the real challenge is how to get data
*out* of the thread dealing with network I/O and *into* the threads
dealing with data.
Bill and others have nailed it, but I'll expand and talk about what
I'd like to see in MacOSX to support middle-ground parallelism in Snow
Leopard ala SEDA or some other mechanism. I don't really want that
much, so who knows, maybe it will happen and Apple will provide an
NSOperationQueueGroup class. I'll use a simplified webserver as a model.
Let's start with some back of the envelope calculations.
3 GHz 64-bit processor running on a 1 GHz backplane = 64 gigabits/
second.
100 Megabit link to the internet at our colocation site, or .1
gigabits.
CPU load to keep the pipe full to the internet: .64%. Except I have
8 cores, so make that
.08%.
IntrAnet? Ok, 1 gigabit/second means .8% CPU load.
Except you never see that low of a CPU load. You never see that
because its much, much easier to write code like the following for a
webserver:
socket=openSocket(PORT_80);
listen(socket);
connection=connect(socket);
SpawnThread(socketThread(connection))
socketThread(connection)
{
while (moreData)
{
request=read(connection);
cachedFile=loadFileIntoCache(request);
send(cachedFile, connection);
}
disconnect(connection);
}
In fact, nearly every introductory programming book in the world
tells you to do it that way. Because the two sections are very linear,
its much easier to think "I need to do this, then I need to do that,
and I'll wrap it in a thread so I can be doing this and that in
parallel".
The problem is that creating a thread, destroying a thread, and
switching threads in and out have a LOT of overhead. The above thread
is going to spend most of its time waiting for I/O, but the OS has no
way of knowing that really. So the OS swaps in the thread, which looks
at its mutex and says "I have nothing to do", and then gets swapped out.
If you spawn a thread per connection, pretty soon your app grinds to
a halt and can't get anything done.
Trivia: Apache 1.x is even worse then this, it forks off a whole new
_process() per connection so you have even more overhead. That's one
of the reasons people tend to put caches in front of apache for
performance (webperfcache on OSX Server, squid on others).
Ironically, its the concept of threads themselves that created this
problem. If you look at pre-thread programming books they all told you
do write code like the following:
socketsWithData=select(opensockets)
for (socket : socketsWithData)
{
serviceSocket()
}
That is, the select call lets you pass in a whole bunch of open
sockets, and it then blocks until one of them has work to do. Remember
how we only need .08% CPU to keep up with the network pipe? Well you
can easily do that with one thread for 1000s of sockets.
The grandson of select is CFNetwork (Or the non-blocking I/O
classes in Java). What CFNetwork does is turn all of that into events,
so you can easily service all the sockets. But as Michael says, "Too
bad you can't avoid blocking at least occasionally". So what you can
then do then do to address that is split the processing of that data,
from the servicing of the sockets. If necessary, you can farm out the
processing of the data to a thread, or even better, NSOperationQueue.
So if we revist our "webserver", this means that if we factor in
CFNetwork, that turns the network code into the following:
// So first, all of the network bookkeeping gets handled by an
event model in the same
// thread.
socket=openSocket(PORT_80);
switch (event)
{
case listen:
listen(socket);
break;
case connect:
connection=connect(socket);
break;
case dataIncoming:
processData(data);
break;
case dataOutgoing:
sendMoreData(data);
break;
case disconnect:
disconnect(connection)
}
// here, we process the incoming data, and if necessary, spawn an
NSOperationQueue
// operation to handle the request.
void processData(data)
{
commandBuffer += data; // accumulate the data in a main thread.
if (substr(data,"\n')) // look for a line feed
{
NSOperationQueue.addOperation(processRequest(data)); //
shove in our to-do list
}
}
// loading files or processing a request might block, so we want
these to be divided
// up into work queues.
void processRequest()
{
data=dataFromCache(request); //blocks if file isn't loaded
CFNetwork.send(cachedFile, connection); // doesn't block
}
Whereas the thread-per-socket design would probably choke at 25-50
connections, our new design can probably easily handle 1000
simultaneous connections without breaking a sweat.
OK, so far, so good, so why am I kvetching exactly? Well, lets talk
about scales of parallelism available in Leopard, with educated
guesses about Snow Leopard based on the press releases.
Nanoscopic: OpenMP and OpenCL let you break out things onto
multiple cores on the for-loop level. I suspect Grand Central is
similar from the press release.
Microscopic: ?
Milliscopic: Threads, NSOperationQueue
Macroscopic: Multiple Processes (see Apache)
See that gap? That's part of what I'm talking about. What I'm
really looking for is either I/O-aware micro-threads, or smarter
operation queues. To give you an example, let's look at processRequest
above. loadFileIntoCacheh(request) is what's doing the work. And in a
functional webserver, that's going to break into the following code:
NSData * dataFromCache(request)
{
filename=parseRequest(); // fastish
NSData *data = cache.dataForFilename(filename); // fast
if (data)
return data;
else
{
return cache.loadDataForFilename(filename); // slow
}
}
The two key insights that the guy who did SEDA had was that he
realized that adding more threads to
cache.loadDataForFilename(filename) won't make it run any faster,
because its I/O bound, not processor bound, the disk can only move so
fast. The second insight he had was that should be the operating
systems job to figure this out, not the programmer! If I go to load a
file into memory, it might be cached, so more threads would help in
that case. As an application programmer, you don't have enough
information to know what to do; even if the file isn't cached, another
file might be on a different disk drive.
So with SEDA, we break our webserver into event handling queues
(called stages). So your webserver turns into the following queues:
Connect
ReadData
parseRequest
loadFileIntoCache
returnFileFromCache
SendData
disconnect
The cool thing about SEDA (you'll have to read the paper for the
details), is that for each queue, SEDA figures out how many threads to
allocate (or share) dynamically based on past performance. You can
give it hints up front, but basically it can determine where adding
parallelism will do the most good. Unlike threads, stages don't
consume much resources when idle, so you can make them freely.
Now you might think that all is overkill, to which I have to respond:
80 cores. Running 8x faster on an 8 core machine is a feature,
running 80x faster on an 80 core machine seems like a business
necessity. Plus there's the on chip cache, you'll want to attract
threads to certain cores if they're actually doing processing instead
of I/O. If we're really going to use 80 cores effectively, we as the
application developers need to give hints to the NSFoundation later
that it can use in conjunction with the kernal to make intelligent
scheduling decisions.
So I would really like to see something like SEDA in Snow Leopard.
This really could be built on top of the existing NSOperation/
NSOperationQueue architecture, NSDynamaicOperationQueue or something.
Alternatives:
"Microthreads", also called "cooperative" threads. These have a
bad name because most people associate them with places where true
preemptive threads aren't available. But I have to say, breaking your
code into an event driven model isn't that much fun. This is where
microthreads come in. Many microthread libraries come with cooperative
I/O libraries. The nice thing about this is that you can write to a
sequential threaded code model (see the original skecth for the
webserver at the top), but what happens is that when you hit some sort
of I/O wait, the microthread library shuffles the thread off to the
side. You end up with the best of both worlds, the simplicity of the
thread coding model, with the performance of something like CFNetwork.
The gotcha here of course is that you need to know when you need to
spawn real threads for something that can be done in parallel.
(Twisted in Python uses this model.)
NSOperationQueue has the beginning of some smarts. It might be
great to see something like: NSFileOperationQueue and
NSNetworkOperationQueue. The idea here would be that you would tell
NSOperationQueue that you expect to be limited by I/O, and it would be
able to stack up the I/O calls appropriately. Presumably, there would
be queue-aware replacements of all the NSFile commands, etc.
If NSOperation had a "yield" method that could be used in
conjunction with I/O methods and NSOperationQueue, this could be used
to develop I/O aware micro-threads.
Anyways, I think this email is long enough now.
Pierce
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden