Re: Poor performance of pthreads across cpu dies.
Re: Poor performance of pthreads across cpu dies.
- Subject: Re: Poor performance of pthreads across cpu dies.
- From: Andrew Gallatin <email@hidden>
- Date: Tue, 4 Sep 2007 09:09:49 -0400 (EDT)
Michael Smith writes:
>
> On Aug 31, 2007, at 10:42 PM, Terry Lambert wrote:
>
> > I'd actually like to see an outside researcher demonstrate
> > benchmarks vs. this approach. Mac OS X doesn't run in virtual wire
> > mode for interrupt delivery, so interrupts are always fielded by the
> > same processor.
>
> This is not necessarily the case.
>
> On some PPC platforms, interrupts are always handled by a single CPU.
>
> On the remainder, and on x86, there are several interrupt distribution
> schemes in play. The x86 algorithm is (as I recall) roughly to prefer
> CPUs that are awake, and of those prefer the CPU with the lowest APIC
> ID that is not currently servicing an interrupt.
>
> As Drew points out, not a lot happens in interrupt context - the real
> work is left to the scheduler invoking the workloop thread - but the
> scheduler isn't terribly well off either. In the specific case of
> network data input, the scheduler would need to know which userland
> thread is currently blocked on (or will shortly read from) the socket
> to which the data that the network adatper has just received will be
> delivered. Never mind that no code has yet looked at this data, nor
> that there may be data for several sockets/threads to be delivered.
The *NIC* has looked at it, and can easily hash different connections
to several MSI-X interrupt handlers, which are then each bound to
different CPUs (or groups of CPUs).
> Once it has worked this out, it needs to know which cache domain
> currently contains the working set for these thread(s), and assess the
> relative cost of moving them such that it can pick a domain in which
> to run the workloop thread as well as any network-stack internal
> threads or callouts. Sadly, Apple has not yet worked out a good
> interface between the scheduler and commonly-available crystal balls,
> and so this remains a difficult thing to do.
Microsoft has though. It is called "receive side scaling" or RSS.
The cleverness is that the NIC and the host has the connections to the
same (sets of) CPUs.
Even in the absence of RSS, multiple MSI-X interrupt handlers, etc,
you can get a decent approximation of a crystal ball by having the
ability to statically bind a single interrupt and a set of hot threads
to set of CPUs. Eg, the administrator is the crystal ball.
Drew
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden