Re: dispatch_apply strange results
Re: dispatch_apply strange results
- Subject: Re: dispatch_apply strange results
- From: Greg Parker <email@hidden>
- Date: Sun, 09 Feb 2014 00:53:11 -0800
On Feb 9, 2014, at 12:19 AM, Gerriet M. Denkmann <email@hidden> wrote:
> The real app (which I am trying to optimise) has actually two loops: one is counting, the other one is modifying. Which seems to be good news.
>
> But I would really like to understand what I should do. Trial and error (or blindly groping in the mist) is not really my preferred way of working.
Optimizing small loops like this is a black art. Very small effects become critically important, such as the alignment of your loop instructions or the associativity of that CPU's L1 cache.
Avoiding cache line contention is the first priority for that code. The array of sums is inefficient. Sums for different threads are present on the same cache line and they are written every time through the loop. (The compiler is unlikely to be able to optimize away those writes because it is unlikely to be able to recognize that the threads don't use each other's sums.) Writing to a cache line requires ownership by one thread to the exclusion of the others. This creates a bottleneck where significant time is spent ping-ponging control of a cache line between multiple threads. The code would likely be faster if each thread maintained its own sum in a local variable, and wrote to the array of sums only at the end. That should reduce cache line contention and also make it more likely that the compiler optimizer can keep the sum in a register, avoiding memory entirely.
Working one byte at a time is almost never the efficient thing to do. If the compiler's autovectorizer isn't improving this code for you then you should look into writing architecture-specific vector code, either in assembly or in the C functions that look almost exactly like vector assembly. Good vector code might be 2-8x faster than byte-at-a-time code.
Cache associativity can mean that there are some array split sizes that are much worse than others. If you choose the wrong size then each thread's working memory is on different cache lines, but those cache lines collide with each other in memory caches. Changing the work size to avoid collisions can help.
Measuring the bottlenecks in this sort of code is difficult. The CPU can count some of the adverse events that this sort of code needs to care about, such as branch mis-prediction and memory cache misses. The Counters and Thread States instruments can record and present some of this information to you.
--
Greg Parker email@hidden Runtime Wrangler
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden