Re: dispatch_async Performance Issues
Re: dispatch_async Performance Issues
- Subject: Re: dispatch_async Performance Issues
- From: Andreas Grosam <email@hidden>
- Date: Tue, 28 Jun 2011 19:22:51 +0200
On Jun 28, 2011, at 3:53 PM, Jonathan Taylor wrote:
Hi Jonathan,
Thank you very much for taking the time to look at it.
> Hi Andreas,
>
> If I understand your post correctly, you are saying that you see a performance drop of 3x when using an iterator in your inner loop as opposed to using hand-written C code to do the iterating.
Yes, at least that was my former assumption.
In the meantime however, I found one (surprising) cause of the performance issue. After making the versions *more* equivalent the issue become apparent. I restructured the second version (using the C++ iterators) and will discuss this in more detail. The culprit is in the consumer part as follows:
New restructured code:
...
#if defined (USE_FUTURE)
__block size_t sum = 0;
__block size_t total = 0;
#endif
dispatch_async(queue,
^{
CFDataConstBuffers_iterator<char> eof;
CFDataConstBuffers_iterator<char> iter(*buffersPtr);
size_t sum_ = 0;
size_t total_ = 0;
while (iter != eof)
{
sum_ += *iter;
++iter;
++total_;
}
#if defined (USE_FUTURE)
sum = sum_;
total = total_;
#endif
semPtr->signal();
});
...
The difference compared to the former code provided in the previous mail is now
1) The C++ instances, that is the iterators, are defined locally within the block.
2) The "Future" (that is the result of the operation) is conditional compiled in or out, in order to test its impact.
Here, the __block modifier is used for the "Future" variables "sum" and "total".
When using pointers within the block accessing the outside variables, the performance does not differ, but using __block may be more correct.
Note, that I access the variables sum and total only once at the end of the block code. This has a reason, which I will explain below.
As mentioned, the conditional #if define (USE_FUTURE) tests its impact on performance.
If USE_FUTURE is defined, the performance drops dramatically!
The same happens, if there are pointer variables which access variables defined outside the block.
The performance drops even more if I would use the __block variables "sum" and "total" directly when incrementing, e.g.:
while (iter != eof)
{
sum += *iter;
++iter;
++total;
}
Even when I access the Future "sum" and "total" only once in the block, the performance penalty is significant.
> Unfortunately you say you haven't actually posted the code relating to the iterator... but it seems to me that this is very likely to be the source of your problems!
OK, appended the source code. :)
These are three files, and a lot for this mail.
But I guess, this is not the cause of the issue.
>
> Your title states "dispatch_async Performance Issues", but have you actually looked into whether you see the same problem in a single-threaded case that does not use dispatch_async?
I haven't *exactly* done this one, but I have a lot of other test cases (see below).
>
> All I can suggest is that you examine the preprocessor output and even the generated assembly code with the aim of spotting any significant differences in the generated code in each case. It may well be that a function call is being generated as part of the iterator code, or something like that. Shark may help you pin down where the problem is, but you will probably need to have some level of appreciation of assembly code to be able to fully interpret the results for your optimized build.
I examined the assembly and from the looks I couldn't find any hints. Both versions looked quite similar. My guess is, it is some synchronization primitive like a spin lock.
After modifying the "direct implemented" and "C++ Iterator" versions, both run now similar if no future is used. If a future is used, for some reason, the runtime drops much more for the C++ Iterator version than the code that mimics the behavior.
But, considering my original goal, performance is a bit slow, that is, not faster then a very primitive implementation which uses one thread, allocates NSData buffers, and adds them to a NSMutableData object which is then processed (see "Classic" bench). This is a bit disappointing. Guess the overhead for dispatch is still high. Here are some results of my bench marks (note that in single threaded approaches the future has no effect since there is none):
**** CFDataBuffers Benchmark ****
Using Future: No
Data size: 131072KB, Buffer size: 8192, N = 16384, C = 2
[SimpleProduceConsume1]: Elapsed time: 299.028ms
[SimpleProduceConsume2]: Elapsed time: 125.744ms
[Classic]: Elapsed time: 218.383ms
[ConcurrentProduceConsume1]: Elapsed time: 123.748ms
[ConcurrentProduceConsume2]: Elapsed time: 271.387ms
[ConcurrentProduceConsumeIter]: Elapsed time: 265.175ms
**** CFDataBuffers Benchmark ****
Using Future: Yes
Data size: 131072KB, Buffer size: 8192, N = 16384, C = 2
[SimpleProduceConsume1]: Elapsed time: 296.692ms
[SimpleProduceConsume2]: Elapsed time: 125.796ms
[Classic]: Elapsed time: 215.36ms
[ConcurrentProduceConsume1]: Elapsed time: 133.07ms
[ConcurrentProduceConsume2]: Elapsed time: 236.686ms
[ConcurrentProduceConsumeIter]: Elapsed time: 400.485ms
(As usual, take the numbers with a grain of salt!)
Increasing the buffer size results in increasingly better performance.
*
* So, big question: where is the time spent when activating a
* "Future" (that is, using a result variable *defined with the
* __block modifier)?
*
N is the number of buffers created and consumed.
C If not otherwise stated in the description, equals the capacity of the concurrent buffers' list, that is the max number of buffers it can hold at once. C has a very limited effect on performance in this test case.
Buffer size may have sever impact on performance if it is too small.
Description:
SimpleProduceConsume1:
// Create a buffers instance with capacity N, then produce N buffers with
// BUFFER_SIZE bytes, fill them and when finished, consume and process them.
// Performs sequential in one thread. Allocatates all required buffers
// for the duration of the whole operation.
// The performance may suffer due to the massive allocations.
SimpleProduceConsume2
// Create a buffers instance with capacity 1, then produce N times one buffer
// with size BUFFER_SIZE and fill it. Consume and process it immediately and
// release the buffer. This versions allocates only one buffer per iteration.
// Performs in one thread.
// Performance should be fast compared to other approaches, however since it
// uses the CFDataConstBuffers object it invovls a certain overhead due to
// its thread safe design (which is not used in this case).
Classic:
// Create one mutable NSData object. For N times create and produce a buffer with
// size BUFFER_SIZE and fill it. Consume the buffer and append the content to
// the mutbale NSData object. This may involve to reallocate the mutable buffer
// and may require to copy the content. When finished, process the content of the
// mutable NSData object.
// This approach seems to have to impose a huge overhead - but it turns out to perform
// quite well, possibly due to internal optimizations.
// Performs in one thread.
ConcurrentProduceConsume1
// Create a buffers instance with capacity C, then produce N buffers with
// BUFFER_SIZE bytes and fill the buffers. Concurrently consume the buffers.
// Performs concurrently on two threads. Straight forward and possibly fast
// implementation.
ConcurrentProduceConsume2
// Mimics the ConcurrentProduceConsumeIter using similar code produced by
// the compiler.
// Create a buffers instance with capacity C, then produce N buffers with
// BUFFER_SIZE bytes and fill them. Concurrently consume the buffers and
// iterate over the consumed buffer.
// Performs concurrently on two threads.
ConcurrentProduceConsumeIter
// Create a buffers instance with capacity C, then produce N buffers with
// BUFFER_SIZE bytes. Concurrently consume the buffers and process it with
// iterating over the buffer's content.
// Performs concurrently on two threads.
> Unfortunately without the crucial part of your source code there's not a lot anyone else can do to help you in that respect...
Since the final source will be Open Source anyway, I can provide it here if required.
Anyway, I strongly guess there is something related to dispatch, blocks or synchronization which affects the performance that heavy.
>
> Hope that helps a bit
> Jonny
Regards
Andreas
Source can be viewed at:
http://codeviewer.org/view/code:1c40
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden