Re: dispatch_async Performance Issues
Re: dispatch_async Performance Issues
- Subject: Re: dispatch_async Performance Issues
- From: Andreas Grosam <email@hidden>
- Date: Wed, 29 Jun 2011 14:16:08 +0200
Hi Jonathan,
I'm happy that your guesses exactly match the results of my investigations. :)
Actually, the compiler (LLVM here) is doing an amazing job when optimizing. Although I'm happy with this, it makes such tests a pain, since often enough the compiler simply strips the code away what I wanted to test :)
So again, I modified the test code in order to prevent the undesired effects of such optimizations. So, I actually *use* the future in the test code, so that the compiler does not strip away dereferencing and does not strip away calculating the sum in the inner loops.
After this modifications, more hints to performance issues became apparent ...
The cause is definitely the way the C++ compiler handles and generates code for possibly C++ exceptions. Accidentally, my Iterator had a non trivial destructor (or more precisely, the buffer class which is a member of the Iterator), namely it releases a CFData object. Due to this, the compiler had to generate some extra code which causes the performance difference.
After fixing the buffers list class and the iterator class, the test performs as expected. With C++ Exceptions enabled, USE_FUTURE defined, and all code fixed so that it actually uses the Future, the test runs as follows:
**** CFDataBuffers Benchmark ****
Using Future: Yes
Data size: 131072KB, Buffer size: 8192, N = 16384, C = 2
[Classic]: Elapsed time: 473.539ms
(Single threaded, naive implementation using a NSMutableData object constructed by appending many NSData buffers)
[ConcurrentProduceConsume1]: Elapsed time: 135.162ms
(GCD, straight and forward implementation, using pointers)
[ConcurrentProduceConsumeIter]: Elapsed time: 363.226ms
(GCD, old and unmodified code, using C++ Iterator concept)
[ConcurrentProduceConsumeIter2]: Elapsed time: 189.002ms
(Fixed Iterator and fixed buffers list)
As usual, take the numbers with a grain of salt. The "workload" of the consumer is minimal, and producing the buffers doesn't block as it would likely happen when downloading data.
When dealing with small amount of data (25KByte) the GCD approach does not perform better. In this scenario, the classic approach would be faster:
Data size: 24KB, Buffer size: 8192, N = 3, C = 2
[Classic]: Elapsed time: 0.0543303ms
[ConcurrentProduceConsumeIter2]: Elapsed time: 0.116253ms
Regards, and thank you again for you help!
Andreas
On Jun 28, 2011, at 9:37 PM, Jonathan Taylor wrote:
>> In the meantime however, I found one (surprising) cause of the performance issue. After making the versions *more* equivalent the issue become apparent. I restructured the second version (using the C++ iterators) and will discuss this in more detail. The culprit is in the consumer part as follows:
>>
>> New restructured code:
>>
>> [...]
>>
>> The difference compared to the former code provided in the previous mail is now
>>
>> 1) The C++ instances, that is the iterators, are defined locally within the block.
>>
>> 2) The "Future" (that is the result of the operation) is conditional compiled in or out, in order to test its impact.
>> Here, the __block modifier is used for the "Future" variables "sum" and "total".
>> When using pointers within the block accessing the outside variables, the performance does not differ, but using __block may be more correct.
>
>
> Ah - now then! I will take a very strong guess as to what is happening there (I've done it myself, and seen it done by plenty of others! [*]). In the case where you do NOT define USE_FUTURE, your consumer thread as written in your email does not make any use of the variables "sum_" and "total_". Hence the compiler is entirely justified in optimizing out those variables entirely! It will still have to check the iterator against eof, and may have to dereference the iterator[**], but it does not need to update the "sum_" or "total_" variables.
>
> It may well be that there is still a deeper performance issue with your original code, and I'm happy to have another look at that when I have a chance. I suggest you deal with this issue first, though, as it appears to be introducing misleading discrepancies in the execution times you're using for comparison.
>
> As I say, it's quite a common issue when you start stripping down code with the aim of doing minimal performance comparisons. The easiest solution is either to printf the results at the end (which forces the compiler to actually evaluate them!), or alternatively do the sort of thing you're doing when USE_FUTURE is defined - writing to a shadow variable at the end. If you declare your shadow variable as "volatile" then the compiler is forced to write the result and is not permitted to optimize everything out.
>
> Hope that helps, even if it may not deal with your original problem yet. Apologies that my first round of guesses were wrong - I'm pretty sure about this one though :)
>
> Jonny
>
>
> [**] Completely unrelated to this thread, but see this rather extreme example where the claimed performance had to be reduced by a factor of twelve due to this problem! http://www.ibm.com/developerworks/forums/thread.jspa?threadID=226415
> [*] I ~think~ ... because this involves a memory access, which is strictly speaking a side effect in itself.
>
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden