Re: Poor performance of pthreads across cpu dies.
Re: Poor performance of pthreads across cpu dies.
- Subject: Re: Poor performance of pthreads across cpu dies.
- From: Terry Lambert <email@hidden>
- Date: Fri, 31 Aug 2007 22:42:19 -0700
On Aug 31, 2007, at 5:33 AM, Andrew Gallatin wrote:
Michael Smith writes:
It is naive to assume, particularly in the context of a heavily
threaded system like MacOS, that thread:CPU affinity will yield
significant improvements here.
It is perhaps more naive to presume that it won't.
My 10GbE driver work makes me pretty familiar with the behavior
under load of the network stacks in MacOSX, Linux, Solaris,
and to a lesser Windows, all of which allow hinting.
If you take another OS (say Linux), and intstall it on a Mac Pro, the
only way to make it perform badly (like MacOSX) is to supply totally
*incorrect* hints (bind the irq handler to one socket, and the
benchmark to the other). And even then it does not perform as badly
as MacOSX does on average :( If I then boot into OSX and apply the
only sort of hinting I'm aware of -- disabling the second CPU package
-- I get "reasonable" performance (2x the CPU utilization of Linux,
and a hair less bandwidth).
If I had a way to bind the ithread, the dlil thread, and
the application to a CPU or set of CPUs, I could coerce
MacOSX into getting decent performance without the drastic step
of disabling a CPU core.
I'd actually like to see an outside researcher demonstrate benchmarks
vs. this approach. Mac OS X doesn't run in virtual wire mode for
interrupt delivery, so interrupts are always fielded by the same
processor. You could maybe do something like the "poll mode" driver
in FreeBSD or like Rice Universities experimental LRP implementation
in NetBSD. Up to a point. But there's no way you could take it from
the interrupt all the way to the top of the stack, without taking the
I/O Kit model out of the picture (and the IOKit abstraction model buys
you a heck of a lot). So on average, you will still take 1.5 CPU
transitions for any workload on a 4 CPU system, even if you had a per-
CPU dlil thread.
Most of the research in this area actually points to layering
violations at a very low level for specific protocols to force
classification of the data by destination process/thread. The
limitation on the supported protocols "undoes" the layering
violation. For systems supporting multiCPU and a large network device
buffer, this goes even further to tagging the descriptor by CPU, and
managing the connections as "named streams", where the interrupt is
sent (wherever).
Since you guys are fond of Solaris these days (Dtrace, ZFS, etc), I
encourage you to take a look at the tools and APIs that Solaris
provides (psrset(1M), processor_bind(2), pset_bind(2).
IMO, the HP/IBM affinity interfaces are actually more likely to make
it through the standards process (they are already cross-vendor). It
would be a serious shame to implement one model, and then end up
having to throw it away because it wasn't the standard model, while
having to maintain both because someone wrote some important
application to the old one.
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden