Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Poor performance of pthreads across cpu dies.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Poor performance of pthreads across cpu dies.

Subject: Re: Poor performance of pthreads across cpu dies.
From: Andrew Gallatin <email@hidden>
Date: Sat, 1 Sep 2007 10:09:59 -0400 (EDT)

Terry Lambert writes:
 > On Aug 31, 2007, at 5:33 AM, Andrew Gallatin wrote:
 > > Michael Smith writes:
 > >> It is naive to assume, particularly in the context of a heavily
 > >> threaded system like MacOS, that thread:CPU affinity will yield
 > >> significant improvements here.
 > >
 > > It is perhaps more naive to presume that it won't.
 > >
 > > My 10GbE driver work makes me pretty familiar with the behavior
 > > under load of the network stacks in MacOSX, Linux, Solaris,
 > > and to a lesser Windows, all of which allow hinting.
 > >
 > > If you take another OS (say Linux), and intstall it on a Mac Pro, the
 > > only way to make it perform badly (like MacOSX) is to supply totally
 > > *incorrect* hints (bind the irq handler to one socket, and the
 > > benchmark to the other).  And even then it does not perform as badly
 > > as MacOSX does on average :( If I then boot into OSX and apply the
 > > only sort of hinting I'm aware of -- disabling the second CPU package
 > > -- I get "reasonable" performance (2x the CPU utilization of Linux,
 > > and a hair less bandwidth).
 > >
 > > If I had a way to bind the ithread, the dlil thread, and
 > > the application to a CPU or set of CPUs, I could coerce
 > > MacOSX into getting decent performance without the drastic step
 > > of disabling a CPU core.
 >
 > I'd actually like to see an outside researcher demonstrate benchmarks
 > vs. this approach.  Mac OS X doesn't run in virtual wire mode for
 > interrupt delivery, so interrupts are always fielded by the same
 > processor.  You could maybe do something like the "poll mode" driver

Yes, and no.  The hardware interrupts are fielded by the same CPU,
but I'm pretty sure the ithread, err, workloop can be scheduled
anywhere.  For an iokit network driver, the hardware interrupt
context is nearly useless, and you have to be running in a
thread context to pass packets up the stack.

 > in FreeBSD or like Rice Universities experimental LRP implementation
 > in NetBSD.  Up to a point.  But there's no way you could take it from
 > the interrupt all the way to the top of the stack, without taking the
 > I/O Kit model out of the picture (and the IOKit abstraction model buys
 > you a heck of a lot).  So on average, you will still take 1.5 CPU
 > transitions for any workload on a 4 CPU system, even if you had a per-
 > CPU dlil thread.

OSes with decent driver models (Solaris, FreeBSD, Linux) have no
problem running from the interrupt (thread) all the way to the top of
the stack and back down into the driver.  So, I assume that iokit
can't do this due to some deadlock around the workloop lock?  Is this
because there is some implicit aquisition of the workloop lock by
iokit on the way to a transmit routine, or do some drivers explicity
run the transmit routine in their workloop?  Or is it some other
deadlock?

BTW, somebody needs to fix dlil and make the queue bounded.
In a livelock situation, you can end up with all the mbuf
clusters on the system sitting on the dlil input queue.
The net.inet.ip.intr_queue_maxlen is a knob connected to
nothing.

 > Most of the research in this area actually points to layering
 > violations at a very low level for specific protocols to force
 > classification of the data by destination process/thread.  The
 > limitation on the supported protocols "undoes" the layering
 > violation.  For systems supporting multiCPU and a large network device
 > buffer, this goes even further to tagging the descriptor by CPU, and
 > managing the connections as "named streams", where the interrupt is
 > sent (wherever).
 >
 > > Since you guys are fond of Solaris these days (Dtrace, ZFS, etc), I
 > > encourage you to take a look at the tools and APIs that Solaris
 > > provides (psrset(1M), processor_bind(2), pset_bind(2).
 >
 > IMO, the HP/IBM affinity interfaces are actually more likely to make
 > it through the standards process (they are already cross-vendor).  It
 > would be a serious shame to implement one model, and then end up
 > having to throw it away because it wasn't the standard model, while
 > having to maintain both because someone wrote some important
 > application to the old one.

Well, please just pick one and go with it. :)  The more momentum it
has behind it, the more likely it is to be a standard,  and
anything is better than nothing.

Drew
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

Next by Date: Re: Poor performance of pthreads across cpu dies.
Next by thread: Re: Poor performance of pthreads across cpu dies.
Index(es):
- Date
- Thread