Re: Poor performance of pthreads across cpu dies.

31 Aug 2007

      site_archiver@lists.apple.com
Delivered-To: darwin-dev@lists.apple.com

Michael Smith writes:
...
On Aug 30, 2007, at 12:04 PM, darwin-dev-request@lists.apple.com wrote:
...
I've seen similar issues benchmarking 10GbE NICs, and I don't even
need pthreads.  The scheduler tends to run the user mode application
on one core, the interrupt handler kernel thread (iokit "workloop") on
another, and the network stack (dlil) kernel thread on yet another.
In and of itself, this is not an issue.  More of a problem is that,
in many cases, the threads you have noted above don't stay in the
same cache domain.  This *is* an issue.
...
II think the fundamental problem is the scheduler doesn't have a clue
about cpu affinity, and MacOSX is lacking any APIs or command line
interfaces that would allow the app or admin to give it a clue (like
you can on Linux, Solaris, etc).
It is naive to assume, particularly in the context of a heavily
threaded system like MacOS, that thread:CPU affinity will yield
significant improvements here.
It is perhaps more naive to presume that it won't.

My 10GbE driver work makes me pretty familiar with the behavior
under load of the network stacks in MacOSX, Linux, Solaris,
and to a lesser Windows, all of which allow hinting.

If you take another OS (say Linux), and intstall it on a Mac Pro, the
only way to make it perform badly (like MacOSX) is to supply totally
*incorrect* hints (bind the irq handler to one socket, and the
benchmark to the other).  And even then it does not perform as badly
as MacOSX does on average :( If I then boot into OSX and apply the
only sort of hinting I'm aware of -- disabling the second CPU package
-- I get "reasonable" performance (2x the CPU utilization of Linux,
and a hair less bandwidth).

If I had a way to bind the ithread, the dlil thread, and
the application to a CPU or set of CPUs, I could coerce
MacOSX into getting decent performance without the drastic step
of disabling a CPU core.
...
The critical issue, as I note above, is not CPU affinity but rather
cache affinity, and more specifically data:cache affinity, since the
penalty is not cache load times as the threads move around (since the
caches in question are typically large enough to hold the thread
working set for each thread in play), but snoop/flush/reload time as
dirty data moves (slowly) from one cache domain to another as it is
handed off from one thread to the next.
Indeed.
...
...
Good would be a scheduler with some
notion of CPU affinity, and better would be a scheduler that allowed
the user to give it some hints.
History suggests that this sort of hinting is a mixed blessing, often
generating more problems than it solves (c.f. the NT stack binding
vs. user comprehension for example).
It is better to enable a user with a clue to fix something
that to prevent a user without a clue from shooting himself
in the foot..
...
The real challenge involves solving, for some reduced set of
circumstances, the very difficult but related questions:
o  What is the system going to do next?
  (e.g. should I schedule a thread that has just become runnable, or
wait in the hope that the current thread will block soon?)
o  Who will want this data (that I have not yet examined or begun
to process) next, and where are they now?
  (e.g. should I move this current thread to a different cache domain
so that the consumer will not have to snoop it over?)
In some restricted, single-activity cases, brute-force hinting
approaches can help with the above.  Sadly, hinting works much less
well once you have more than one source of hints, or more than one
topology in play, or a topology that is not understood by the hinter;
an adaptive and automatic solution is much more attractive in that case.
I don't think there's any disagreement that the current situation
isn't very good; merely that the proposed 'solutions' fall fairly
well short and that something better is really needed.
I'm still of the opinion that something is better than nothing.
Even if it is flawed, it can't be worse than the current situation
and could be beneficial.

Since you guys are fond of Solaris these days (Dtrace, ZFS, etc), I
encourage you to take a look at the tools and APIs that Solaris
provides (psrset(1M), processor_bind(2), pset_bind(2).

Drew
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list      (Darwin-dev@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.appl...

This email sent to site_archiver@lists.apple.com

Andrew Gallatin

tags

participants (1)