Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Road to 80 cores ( Was Re: Grand Central Details)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Road to 80 cores ( Was Re: Grand Central Details)

Subject: Road to 80 cores ( Was Re: Grand Central Details)
From: "Pierce T. Wetter III" <email@hidden>
Date: Wed, 18 Jun 2008 10:50:53 -0700


  Topic/focus changed to please the moderator.

Too bad you can't avoid blocking at least occasionally with the event-driven APIs, meaning you still have to use threads to avoid it completely. And I fail to see what's so bad about having one thread per socket. Is it because Threads Are Hard?
In that case, it is because threads are relatively expensive. Every thread adds a bit of memory use -- not insignificant given each individual thread's stack -- and scheduling overhead. In this model, you'd expect that most threads will be blocked on I/O most of the time, but you might also likely find that performance goes to hell in a handbasket as soon as multiple sockets are lit up with inbound data.

And, yes, threads are hard, though -- in this case -- that hardness is a bit irrelevant in that the real challenge is how to get data *out* of the thread dealing with network I/O and *into* the threads dealing with data.

Bill and others have nailed it, but I'll expand and talk about what I'd like to see in MacOSX to support middle-ground parallelism in Snow Leopard ala SEDA or some other mechanism. I don't really want that much, so who knows, maybe it will happen and Apple will provide an NSOperationQueueGroup class. I'll use a simplified webserver as a model.

 Let's start with some back of the envelope calculations.

3 GHz 64-bit processor running on a 1 GHz backplane = 64 gigabits/ second.

100 Megabit link to the internet at our colocation site, or .1 gigabits.

CPU load to keep the pipe full to the internet: .64%. Except I have 8 cores, so make that .08%.

  IntrAnet? Ok, 1 gigabit/second means .8% CPU load.

Except you never see that low of a CPU load. You never see that because its much, much easier to write code like the following for a webserver:

   socket=openSocket(PORT_80);
   listen(socket);
   connection=connect(socket);
   SpawnThread(socketThread(connection))


   socketThread(connection)
   {
      while (moreData)
      {
	  request=read(connection);
          cachedFile=loadFileIntoCache(request);
          send(cachedFile, connection);
       }
       disconnect(connection);
   }

In fact, nearly every introductory programming book in the world tells you to do it that way. Because the two sections are very linear, its much easier to think "I need to do this, then I need to do that, and I'll wrap it in a thread so I can be doing this and that in parallel".

The problem is that creating a thread, destroying a thread, and switching threads in and out have a LOT of overhead. The above thread is going to spend most of its time waiting for I/O, but the OS has no way of knowing that really. So the OS swaps in the thread, which looks at its mutex and says "I have nothing to do", and then gets swapped out.

If you spawn a thread per connection, pretty soon your app grinds to a halt and can't get anything done.

Trivia: Apache 1.x is even worse then this, it forks off a whole new _process() per connection so you have even more overhead. That's one of the reasons people tend to put caches in front of apache for performance (webperfcache on OSX Server, squid on others).

Ironically, its the concept of threads themselves that created this problem. If you look at pre-thread programming books they all told you do write code like the following:

  socketsWithData=select(opensockets)
  for (socket : socketsWithData)
  {
     serviceSocket()
  }

That is, the select call lets you pass in a whole bunch of open sockets, and it then blocks until one of them has work to do. Remember how we only need .08% CPU to keep up with the network pipe? Well you can easily do that with one thread for 1000s of sockets.

The grandson of select is CFNetwork (Or the non-blocking I/O classes in Java). What CFNetwork does is turn all of that into events, so you can easily service all the sockets. But as Michael says, "Too bad you can't avoid blocking at least occasionally". So what you can then do then do to address that is split the processing of that data, from the servicing of the sockets. If necessary, you can farm out the processing of the data to a thread, or even better, NSOperationQueue.

So if we revist our "webserver", this means that if we factor in CFNetwork, that turns the network code into the following:

// So first, all of the network bookkeeping gets handled by an event model in the same // thread. socket=openSocket(PORT_80);

   switch (event)
   {
        case listen:
            listen(socket);
             break;
        case connect:
             connection=connect(socket);
             break;
        case dataIncoming:
             processData(data);
             break;
        case dataOutgoing:
             sendMoreData(data);
             break;
        case disconnect:
             disconnect(connection)
   }

// here, we process the incoming data, and if necessary, spawn an NSOperationQueue // operation to handle the request. void processData(data) { commandBuffer += data; // accumulate the data in a main thread. if (substr(data,"\n')) // look for a line feed { NSOperationQueue.addOperation(processRequest(data)); // shove in our to-do list }

// loading files or processing a request might block, so we want these to be divided // up into work queues. void processRequest() { data=dataFromCache(request); //blocks if file isn't loaded CFNetwork.send(cachedFile, connection); // doesn't block }

Whereas the thread-per-socket design would probably choke at 25-50 connections, our new design can probably easily handle 1000 simultaneous connections without breaking a sweat.

OK, so far, so good, so why am I kvetching exactly? Well, lets talk about scales of parallelism available in Leopard, with educated guesses about Snow Leopard based on the press releases.

Nanoscopic: OpenMP and OpenCL let you break out things onto multiple cores on the for-loop level. I suspect Grand Central is similar from the press release.

  Microscopic: ?

  Milliscopic: Threads, NSOperationQueue

  Macroscopic: Multiple Processes (see Apache)

See that gap? That's part of what I'm talking about. What I'm really looking for is either I/O-aware micro-threads, or smarter operation queues. To give you an example, let's look at processRequest above. loadFileIntoCacheh(request) is what's doing the work. And in a functional webserver, that's going to break into the following code:

 NSData * dataFromCache(request)
 {
     filename=parseRequest();  // fastish
     NSData *data = cache.dataForFilename(filename); // fast
     if (data)
	return data;
     else
     {
         return cache.loadDataForFilename(filename); // slow
     }

The two key insights that the guy who did SEDA had was that he realized that adding more threads to cache.loadDataForFilename(filename) won't make it run any faster, because its I/O bound, not processor bound, the disk can only move so fast. The second insight he had was that should be the operating systems job to figure this out, not the programmer! If I go to load a file into memory, it might be cached, so more threads would help in that case. As an application programmer, you don't have enough information to know what to do; even if the file isn't cached, another file might be on a different disk drive.

So with SEDA, we break our webserver into event handling queues (called stages). So your webserver turns into the following queues:

  Connect
  ReadData
  parseRequest
  loadFileIntoCache
  returnFileFromCache
  SendData
  disconnect

The cool thing about SEDA (you'll have to read the paper for the details), is that for each queue, SEDA figures out how many threads to allocate (or share) dynamically based on past performance. You can give it hints up front, but basically it can determine where adding parallelism will do the most good. Unlike threads, stages don't consume much resources when idle, so you can make them freely.

  Now you might think that all is overkill, to which I have to respond:

80 cores. Running 8x faster on an 8 core machine is a feature, running 80x faster on an 80 core machine seems like a business necessity. Plus there's the on chip cache, you'll want to attract threads to certain cores if they're actually doing processing instead of I/O. If we're really going to use 80 cores effectively, we as the application developers need to give hints to the NSFoundation later that it can use in conjunction with the kernal to make intelligent scheduling decisions.

So I would really like to see something like SEDA in Snow Leopard. This really could be built on top of the existing NSOperation/ NSOperationQueue architecture, NSDynamaicOperationQueue or something.

  Alternatives:

"Microthreads", also called "cooperative" threads. These have a bad name because most people associate them with places where true preemptive threads aren't available. But I have to say, breaking your code into an event driven model isn't that much fun. This is where microthreads come in. Many microthread libraries come with cooperative I/O libraries. The nice thing about this is that you can write to a sequential threaded code model (see the original skecth for the webserver at the top), but what happens is that when you hit some sort of I/O wait, the microthread library shuffles the thread off to the side. You end up with the best of both worlds, the simplicity of the thread coding model, with the performance of something like CFNetwork. The gotcha here of course is that you need to know when you need to spawn real threads for something that can be done in parallel. (Twisted in Python uses this model.)

NSOperationQueue has the beginning of some smarts. It might be great to see something like: NSFileOperationQueue and NSNetworkOperationQueue. The idea here would be that you would tell NSOperationQueue that you expect to be limited by I/O, and it would be able to stack up the I/O calls appropriately. Presumably, there would be queue-aware replacements of all the NSFile commands, etc.

If NSOperation had a "yield" method that could be used in conjunction with I/O methods and NSOperationQueue, this could be used to develop I/O aware micro-threads.

 Anyways, I think this email is long enough now.

 Pierce

_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: Road to 80 cores ( Was Re: Grand Central Details)
From: "Kyle Sluder" <email@hidden>


References:  
  >Grand Central Details (From: "Pierce T. Wetter III" <email@hidden>)
  >Re: Grand Central Details (From: Michael Gardner <email@hidden>)
  >Re: Grand Central Details (From: Bill Bumgarner <email@hidden>)




Prev by Date:
Re: Getting all subclasses

Next by Date:
disappearing NSSpellingStateAttributeName

Previous by thread:
Re: Grand Central Details

Next by thread:
Re: Road to 80 cores ( Was Re: Grand Central Details)

Index(es):

Date
Thread