Road to 80 cores ( Was Re: Grand Central Details)

Pierce T. Wetter III Wed, 18 Jun 2008 10:51:09 -0700


  Topic/focus changed to please the moderator.

Too bad you can't avoid blocking at least occasionally with theevent-driven APIs, meaning you still have to use threads to avoidit completely. And I fail to see what's so bad about having onethread per socket. Is it because Threads Are Hard?
In that case, it is because threads are relatively expensive.Every thread adds a bit of memory use -- not insignificant giveneach individual thread's stack -- and scheduling overhead. In thismodel, you'd expect that most threads will be blocked on I/O most ofthe time, but you might also likely find that performance goes tohell in a handbasket as soon as multiple sockets are lit up withinbound data.
And, yes, threads are hard, though -- in this case -- that hardnessis a bit irrelevant in that the real challenge is how to get data*out* of the thread dealing with network I/O and *into* the threadsdealing with data.

Bill and others have nailed it, but I'll expand and talk about whatI'd like to see in MacOSX to support middle-ground parallelism in SnowLeopard ala SEDA or some other mechanism. I don't really want thatmuch, so who knows, maybe it will happen and Apple will provide anNSOperationQueueGroup class. I'll use a simplified webserver as a model.


 Let's start with some back of the envelope calculations.

3 GHz 64-bit processor running on a 1 GHz backplane = 64 gigabits/second.

100 Megabit link to the internet at our colocation site, or .1gigabits.

CPU load to keep the pipe full to the internet: .64%. Except I have8 cores, so make that

  .08%.

  IntrAnet? Ok, 1 gigabit/second means .8% CPU load.

Except you never see that low of a CPU load. You never see thatbecause its much, much easier to write code like the following for awebserver:


   socket=openSocket(PORT_80);
   listen(socket);
   connection=connect(socket);
   SpawnThread(socketThread(connection))


   socketThread(connection)
   {
      while (moreData)
      {
          request=read(connection);
          cachedFile=loadFileIntoCache(request);
          send(cachedFile, connection);
       }
       disconnect(connection);
   }

In fact, nearly every introductory programming book in the worldtells you to do it that way. Because the two sections are very linear,its much easier to think "I need to do this, then I need to do that,and I'll wrap it in a thread so I can be doing this and that inparallel".

The problem is that creating a thread, destroying a thread, andswitching threads in and out have a LOT of overhead. The above threadis going to spend most of its time waiting for I/O, but the OS has noway of knowing that really. So the OS swaps in the thread, which looksat its mutex and says "I have nothing to do", and then gets swapped out.

If you spawn a thread per connection, pretty soon your app grinds toa halt and can't get anything done.

Trivia: Apache 1.x is even worse then this, it forks off a whole new_process() per connection so you have even more overhead. That's oneof the reasons people tend to put caches in front of apache forperformance (webperfcache on OSX Server, squid on others).

Ironically, its the concept of threads themselves that created thisproblem. If you look at pre-thread programming books they all told youdo write code like the following:


  socketsWithData=select(opensockets)
  for (socket : socketsWithData)
  {
     serviceSocket()
  }

That is, the select call lets you pass in a whole bunch of opensockets, and it then blocks until one of them has work to do. Rememberhow we only need .08% CPU to keep up with the network pipe? Well youcan easily do that with one thread for 1000s of sockets.

The grandson of select is CFNetwork (Or the non-blocking I/Oclasses in Java). What CFNetwork does is turn all of that into events,so you can easily service all the sockets. But as Michael says, "Toobad you can't avoid blocking at least occasionally". So what you canthen do then do to address that is split the processing of that data,from the servicing of the sockets. If necessary, you can farm out theprocessing of the data to a thread, or even better, NSOperationQueue.

So if we revist our "webserver", this means that if we factor inCFNetwork, that turns the network code into the following:

// So first, all of the network bookkeeping gets handled by anevent model in the same

   // thread.
   socket=openSocket(PORT_80);

   switch (event)
   {
        case listen:
            listen(socket);
             break;
        case connect:
             connection=connect(socket);
             break;
        case dataIncoming:
             processData(data);
             break;
        case dataOutgoing:
             sendMoreData(data);
             break;
        case disconnect:
             disconnect(connection)
   }

// here, we process the incoming data, and if necessary, spawn anNSOperationQueue

   // operation to handle the request.
   void processData(data)
   {
      commandBuffer += data; // accumulate the data in a main thread.
      if (substr(data,"\n')) // look for a line feed
      {

NSOperationQueue.addOperation(processRequest(data)); //shove in our to-do list

      }

   }

// loading files or processing a request might block, so we wantthese to be divided

   // up into work queues.
   void processRequest()
   {
       data=dataFromCache(request); //blocks if file isn't loaded
       CFNetwork.send(cachedFile, connection);  // doesn't block
   }

Whereas the thread-per-socket design would probably choke at 25-50connections, our new design can probably easily handle 1000simultaneous connections without breaking a sweat.

OK, so far, so good, so why am I kvetching exactly? Well, lets talkabout scales of parallelism available in Leopard, with educatedguesses about Snow Leopard based on the press releases.

Nanoscopic: OpenMP and OpenCL let you break out things ontomultiple cores on the for-loop level. I suspect Grand Central issimilar from the press release.


  Microscopic: ?

  Milliscopic: Threads, NSOperationQueue

  Macroscopic: Multiple Processes (see Apache)

See that gap? That's part of what I'm talking about. What I'mreally looking for is either I/O-aware micro-threads, or smarteroperation queues. To give you an example, let's look at processRequestabove. loadFileIntoCacheh(request) is what's doing the work. And in afunctional webserver, that's going to break into the following code:


 NSData * dataFromCache(request)
 {
     filename=parseRequest();  // fastish
     NSData *data = cache.dataForFilename(filename); // fast
     if (data)
        return data;
     else       
     {
         return cache.loadDataForFilename(filename); // slow
     }

 }

The two key insights that the guy who did SEDA had was that herealized that adding more threads tocache.loadDataForFilename(filename) won't make it run any faster,because its I/O bound, not processor bound, the disk can only move sofast. The second insight he had was that should be the operatingsystems job to figure this out, not the programmer! If I go to load afile into memory, it might be cached, so more threads would help inthat case. As an application programmer, you don't have enoughinformation to know what to do; even if the file isn't cached, anotherfile might be on a different disk drive.

So with SEDA, we break our webserver into event handling queues(called stages). So your webserver turns into the following queues:


  Connect
  ReadData
  parseRequest
  loadFileIntoCache
  returnFileFromCache
  SendData
  disconnect

The cool thing about SEDA (you'll have to read the paper for thedetails), is that for each queue, SEDA figures out how many threads toallocate (or share) dynamically based on past performance. You cangive it hints up front, but basically it can determine where addingparallelism will do the most good. Unlike threads, stages don'tconsume much resources when idle, so you can make them freely.


  Now you might think that all is overkill, to which I have to respond:

80 cores. Running 8x faster on an 8 core machine is a feature,running 80x faster on an 80 core machine seems like a businessnecessity. Plus there's the on chip cache, you'll want to attractthreads to certain cores if they're actually doing processing insteadof I/O. If we're really going to use 80 cores effectively, we as theapplication developers need to give hints to the NSFoundation laterthat it can use in conjunction with the kernal to make intelligentscheduling decisions.

So I would really like to see something like SEDA in Snow Leopard.This really could be built on top of the existing NSOperation/NSOperationQueue architecture, NSDynamaicOperationQueue or something.


  Alternatives:

"Microthreads", also called "cooperative" threads. These have abad name because most people associate them with places where truepreemptive threads aren't available. But I have to say, breaking yourcode into an event driven model isn't that much fun. This is wheremicrothreads come in. Many microthread libraries come with cooperativeI/O libraries. The nice thing about this is that you can write to asequential threaded code model (see the original skecth for thewebserver at the top), but what happens is that when you hit some sortof I/O wait, the microthread library shuffles the thread off to theside. You end up with the best of both worlds, the simplicity of thethread coding model, with the performance of something like CFNetwork.The gotcha here of course is that you need to know when you need tospawn real threads for something that can be done in parallel.(Twisted in Python uses this model.)

NSOperationQueue has the beginning of some smarts. It might begreat to see something like: NSFileOperationQueue andNSNetworkOperationQueue. The idea here would be that you would tellNSOperationQueue that you expect to be limited by I/O, and it would beable to stack up the I/O calls appropriately. Presumably, there wouldbe queue-aware replacements of all the NSFile commands, etc.

If NSOperation had a "yield" method that could be used inconjunction with I/O methods and NSOperationQueue, this could be usedto develop I/O aware micro-threads.


 Anyways, I think this email is long enough now.

 Pierce

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Road to 80 cores ( Was Re: Grand Central Details)

Reply via email to