On Dec 20, 2005, at 4:25 PM, Ivan S. Kourtev wrote:
Thanks, Terry. That's all very good advice and I'll take it into
consideration.
So now my code knows how to find out that the system it is running
on has N > 1 processors. Say, for example, it is running on a
system with N = 2 processors so it wants to use a two-thread version
of some heavily computational routine. This is only going to be
efficient if the two threads actually use different physical
processes.
I think you meant processors?
The answer to that is actually more tricky than you'd expect. It
really comes down to your cache locality, and how independently or
interdependently your threads operate. For some applications, you
would want negaffinity - what you are describing here as wanting to
run on physically separate CPUs - while for others this would result
in significant cache busting and IPIs for extensive TLB shootdown, if
two or more threads were modifying data in the same locality. For
those, you'd want strong affinity, maybe even deciding to run them on
the same ALU in an SMT system in an idealized control-all-aspects-of-
scheduling implementation.
So here a few interesting questions arise:
1. Is there any way to specify that threads run on physically
different processors/cores? I know the OS is supposed to be smart
but maybe not that smart because the thread-creating code isn't
aware of the context of the program being computed.
The OS has several ways to do this, but it is only used internally,
not exported for use by users. The main reasoning behind this is (1)
power management makes it a requirement that we be able to bring up
and shutdown resources as necessary to meet minimum requirements for
system load and (2) there are a lot of RT and RT-like tasks that MacOS
X must support, and in order to do that, the system needs to be in
charge of resource allocation (consider binding a thread to a
processor that the system decides needs to be shut down for whatever
reason).
In addition, the central routine is not generally well protected
against shooting your foot off, hence they are not exported.
It's possible for you to write a KEXT which would allow you to call
the routine yourself (it's pretty obvious what it is, if you look
through the scheduler code on OpenDarwin.org), but it's highly
inadvisable. This is truly one place where the system probably knows
better than you. Minimally, you would need to recompile your KEXT
every point release - effectively, every software update - to keep it
from going stale, since to get at the symbol set, you'd need to link
directly instead of using a subset, which would tie you pretty tightly
to a given kernel.
2. Particularly if the answer to the above question is YES, is it
guaranteed that a thread will spend its entire life within the
physical processor it first started on? I am not terribly familiar
with the low-level stuff but, when a thread has exhausted its time
slice, could it be scheduled on a different physical processor the
next time around?
If you force the affinity, the affinity sticks until you force it
off. Period. But as I said, it's definitely not recommended, and
will likely shoot your foot off.
3. Depending on what the answers of 1. and 2. are, it seems to be it
would be useful to have a mechanism for "locking" a thread to a
processor? Sort of to make the maximum use of the available
hardware? Particularly in the case when a process wants to start M
threads where M <= N available processors.
Not really.
If you are using system frameworks, you really can't guarantee your
locality, and if you are writing your own stuff, even then, you really
can't guarantee your locality well enough for a given piece of
hardware, without tying yourself to that particular piece of hardware
forever.
Consider the case where you tune your code for 64M vs. 128M vs.
whatever of cache, or for 32 or 64 or whatever TLB slots; that code
really isn't going to run very well if you end up running it on any
other hardware and go over some local hardware limit that wasn't on
the machine you brought it up on.
Most people who use threads instead of finite state automatons as
their programming model tend to actually not get the separation
between threads strong enough that they never contend between
themselves for resources, never need to use thread IPC, never need to
use a mutex, etc..
I don't think my specific situations are very complicated but am
trying to use the opportunity to teach myself a programming style
which is certainly very new to me. My situations involve mostly
high-complexity matrix and graph computations that can be
partitioned and parallelized easily. Nothing extra fancy (like data-
races, and the necessity for communication, locking, etc.). It's
pretty safe to say that the N parallel threads can proceed full
speed ahead -- depending on what hardware is available -- I
basically need a lot of computational cycles available.
This is really the exception, rather than the rule.
Obviously, you can build a KEXT and experiment with calling it to make
the calls by proxy to establish particular CPU affinity for your
threads, but I don't think you are going to get a significant
performance win in doing so. It may be that you don't have an iTunes
or a CD burner or a DVD player or other RT task running on this system
the same time as your calculation which could be damaged by binding
threads to particular CPUs (and you've turned down/off power
management so it doesn't interfere with your calculations); obviously,
experimenting is fine - let us know how things turn out. 8^).
Finally, I am particularly interested in knowing the answers for Mac
OS X, but it would be lovely to know what the POSIX ways (if any)
are of doing these things.
IMO, POSIX has consistently specifically gone out of its way to avoid
addressing this issue. The closest it gets to anything that could be
abused to get this behaviour is if the user specifies a scheduling
policy of SCHED_OTHER, and permitting other policies to be defined by
the implementation as the second parameter to sched_setscheduler().
In effect, there's no standards-conformant way of doing this that
isn't implementation defined. I think this is because the jury is
still out on whether a program can be built to run on a general
purpose OS, and still understand the hardware and system load
characteristics enough to make effective informed decisions that are
enough of a performance win to want the APIs everywhere. If someone
was demonstrating a 50% performance improvement, you could bet some
set of APIs would be implemented everywhere in short order to take
advantage of the win.
I expect that at some point POSIX will revisit the issue, and maybe
introduce *optional* APIs that let you deal with these things; I
expect that they would likely go into sched_setscheduler() and
sched_setparam(), etc., rather than adding new entry points into the
system.
-- Terry
Thanks for any pointers.
Cheers,
--
ivan
On Dec 19, 2005, at 10:10 PM, Terry Lambert wrote:
On Dec 18, 2005, at 9:58 PM, Eric Albert wrote:
On Dec 18, 2005, at 5:45 PM, Ivan S. Kourtev wrote:
First, I tried sysconf() but it doesn't seem capable of doing
what I need under Mac OS? The two variables
_SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN are undefined in
unistd.h -- am I missing something?
They don't seem to be defined on Mac OS X. I'd suggest filing a
bug report with Apple (<http://bugreport.apple.com>) if you'd like
to see them added.
They are non-standard extensions tot he sysconf namespace. They
are unlikely to be included even if a bug report is filed, since
that particular namespace belongs to the standards committee; if
they come up with the same name that meant something else, we
wouldn't be able to implement it correctly because of binary
backward compatibility issues, so it's better if we don't add it.
The reason it's in the man page is that our man page is cribbed
from FreeBSD, and FreeBSD implements them. Our manual page there
is fairly out of data, but man page fixes are unlikely to make it
into a software update, for various reasons. The authoritative
reference is the contents of unistd.h.
The only ones you can actually use portably between platforms are
the ones defined by POSIX (assuming your other platforms are POSIX
compliant).
I also looked into sysctl as per Daniel's and Eric's suggestions
-- I noticed even the sysconf manpage suggests that the sysctl
interface is much richer. On Mac OS X, I got some code working
right away (attached at end) but I haven't figured out how to get
it to go under redhat (everything I do must work under both Mac
OS X and redhat at least). redhat has a sys/sysctl.h but it only
contains the declaration of sysctl() and none of the keywords. I
realize this may be a little off-topic, but any clues?
This sounds like a great job for a configure script. This is
hardly the only difference between Mac OS X and Linux. :) Another
alternative is to do something like
#ifdef HW_NCPU
...do the sysctl thing...
#elif defined(_SC_NPROCESSORS_CONF)
...do the sysconf thing...
#else
#error Uh oh.
#endif
sysctl is the way it should be done. As otherwise noted in this
thread already, these particular sysctl entries are generally
portable between 4.4BSD based systems.
By the way, in the code below, what are the HW_ keywords (if any)
that correspond to the commented out entries in the data[]
array? I really only need HW_NCPU and HW_AVAILCPU for now but
just out of curiosity?
Not all sysctl entries have numeric items to go along with their
names. Sometimes you just have to use sysctlbyname.
And in fact you should use names everywhere you can, rather than
OIDs, for forward code compatibility. We are likely to change
things in the future, particularly in this area of sysctl, and
sysctlbyname will be less fragile. I expect the current values
won't change (i.e. suddenly stop working for already compiled
code), but the sysctlbyname() is the preferred interface going
forward.
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Unix-porting mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/unix-porting/email@hidden