Mailing Lists: Apple Mailing Lists
Image of Mac OS face in stamp
Re: StackSpace / pthread_get_stackaddr_np contention
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: StackSpace / pthread_get_stackaddr_np contention



I received the following reply. While working around this particular problem will be easy, it makes me wonder what other system calls I use, though thread-safe might in the future introduce contention.

At 9:27 AM +0100 4/15/08, Quinn wrote:
Did StackSpace or pthread_get_stackaddr_np change recently on 8-core machines?

It looks like it changed significantly in 10.5. In 10.4 it was a simple field access.

<http://www.opensource.apple.com/darwinsource/10.4/Libc-391/pthreads/pthread.c>

In 10.5 it's doing a linear search of the threads list (which is
protected by a spin lock).

<http://www.opensource.apple.com/darwinsource/10.5/Libc-498/pthreads/pthread.c>

Not good for you I'm afraid.  This slowdown is definitely bugworthy
IMHO; while I'm sure there was a good reason for the change, such a
radical slowdown on a routine that is likely to be called often is
just bad.

As to a workaround, I suggest you call this routine once at the top
level, cache the result is a per-thread variable, and then do your
own calculations based on that and the current stack poiner.

S+E
--
Quinn "The Eskimo!"       <http://www.quinn.echidna.id.au/Quinn/WWW/>


On 4/14/08, Ron Avitzur wrote:
Hello,

Did StackSpace or pthread_get_stackaddr_np change recently on 8-core machines?

Today's benchmark shows one test 4x slower when running on eight cores than when running on one core rather than the expected 8x faster, as the problem is embarrassingly parallelizable and coded to take advantage of that.

The time is spent waiting in __spin_lock under pthread_get_stackaddr_np called from StackSpace(). The innermost loop is a genuinely recursive function which needs a large, size determined at runtime scratchspace, allocated via alloca
for speed since most of the time it will fit on the stack, and alloca should not have any contention between cores where malloc might. Before calling alloca, however, the size is checked preflight using StackSpace, falling back on malloc if there's not enough.


Alas, StackSpace is synchronizing all the cores and killing performance. It did not used to do that.

Is there something I need to understand? Is that just a system bug? Or is there a reason I shouldn't have expected to determine the available stack space in a multi-threaded environment without incurring any synchronization overhead? What shared resource am I inadvertently serializing upon?

Do I need to avoid calling StackSpace? (I certainly do now, but if that's fixed in a system update, this won't be shipping for a while so I can wait.)

If I need to avoid it, what is the recommended fast, non-serializing way to preflight before calling alloca to ensure alloca won't blow out the stack?
This function is recursive, so keeping a persistent static store doesn't help, and performance is critical here, which is why alloca was an ideal fit.


In general, this raises the concern that even after testing and shipping a product, the addition of a lock in an obscure system call via System Update could have extreme performance consequences. I'd been spoiled for decades by regular clock speed bumps. I'd hope that, at least with embarrassingly parallelizable problems, we can look forward to years of increase in the number of cores providing similar benefit.

- Ron
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden




Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2011 Apple Inc. All rights reserved.