On Jun 28, 2011, at 3:53 PM, Jonathan Taylor wrote:

Hi Jonathan, 

Thank you very much for taking the time to look at it.

> Hi Andreas,
> 
> If I understand your post correctly, you are saying that you see a 
> performance drop of 3x when using an iterator in your inner loop as opposed 
> to using hand-written C code to do the iterating.

Yes, at least that was my former assumption.

In the meantime however, I found one (surprising) cause of the performance 
issue. After making the versions  *more* equivalent the issue become apparent.  
I restructured the second version (using the C++ iterators) and will discuss 
this in more detail. The culprit is in the consumer part as follows:

New restructured code:

        ...
#if defined (USE_FUTURE)
    __block size_t sum = 0;
    __block size_t total = 0;
#endif
    
    dispatch_async(queue,
     ^{
         CFDataConstBuffers_iterator<char> eof;
         CFDataConstBuffers_iterator<char> iter(*buffersPtr);
         
         size_t sum_ = 0;
         size_t total_ = 0;
         
         while (iter != eof)
         {
             sum_ += *iter;
             ++iter;
             ++total_;
         }
         
#if defined (USE_FUTURE)
         sum = sum_;
         total = total_;
#endif
         semPtr->signal();
     });
     ...


The difference compared to the former code provided in the previous mail is now

1) The C++ instances, that is the iterators, are defined locally within the 
block.

2) The "Future" (that is the result of the operation) is conditional compiled 
in or out, in order to test its impact.
     Here, the __block modifier is used for the "Future" variables "sum" and 
"total".  
     When using pointers within the block accessing the outside variables, the 
performance does not differ, but using __block may be more correct.


Note, that I access the variables sum and total only once at the end of the 
block code. This has a reason, which I will explain below.

        
As mentioned, the conditional  #if define (USE_FUTURE) tests its impact on 
performance.

If USE_FUTURE is defined, the performance drops dramatically!
The same happens, if there are pointer variables which access variables defined 
outside the block.

The performance drops even more if I would use the __block variables "sum" and 
"total" directly when incrementing, e.g.:

         while (iter != eof)
         {
             sum += *iter;
             ++iter;
             ++total;
         }

Even when I access the Future "sum" and "total" only once in the block, the 
performance penalty is significant.


> Unfortunately you say you haven't actually posted the code relating to the 
> iterator... but it seems to me that this is very likely to be the source of 
> your problems!

OK, appended the source code. :) 
These are three files, and a lot for this mail.
But I guess, this is not the cause of the issue.

> 
> Your title states "dispatch_async Performance Issues", but have you actually 
> looked into whether you see the same problem in a single-threaded case that 
> does not use dispatch_async?
I haven't *exactly* done this one, but I have a lot of other test cases (see 
below). 
> 
> All I can suggest is that you examine the preprocessor output and even the 
> generated assembly code with the aim of spotting any significant differences 
> in the generated code in each case. It may well be that a function call is 
> being generated as part of the iterator code, or something like that. Shark 
> may help you pin down where the problem is, but you will probably need to 
> have some level of appreciation of assembly code to be able to fully 
> interpret the results for your optimized build. 

I examined the assembly and from the looks I couldn't find any hints. Both 
versions looked quite similar. My guess is, it is some synchronization 
primitive like a spin lock.

After modifying the "direct implemented" and "C++ Iterator" versions, both run 
now similar if no future is used. If a future is used, for some reason, the 
runtime drops much more for the C++ Iterator version than the code that mimics 
the behavior.

But, considering my original goal, performance is a bit slow, that is, not 
faster then a very primitive implementation which uses one thread, allocates 
NSData buffers, and adds them to a NSMutableData object which is then processed 
(see "Classic" bench). This is a bit disappointing. Guess the overhead for 
dispatch is still high. Here are some results of my bench marks (note that in 
single threaded approaches the future has no effect since there is none):


**** CFDataBuffers Benchmark ****
Using  Future: No
Data size: 131072KB, Buffer size: 8192, N = 16384, C = 2
[SimpleProduceConsume1]:        Elapsed time: 299.028ms  
[SimpleProduceConsume2]:        Elapsed time: 125.744ms
[Classic]:                      Elapsed time: 218.383ms
[ConcurrentProduceConsume1]:    Elapsed time: 123.748ms
[ConcurrentProduceConsume2]:    Elapsed time: 271.387ms
[ConcurrentProduceConsumeIter]: Elapsed time: 265.175ms

**** CFDataBuffers Benchmark ****
Using  Future: Yes
Data size: 131072KB, Buffer size: 8192, N = 16384, C = 2
[SimpleProduceConsume1]:        Elapsed time: 296.692ms
[SimpleProduceConsume2]:        Elapsed time: 125.796ms
[Classic]:                      Elapsed time: 215.36ms
[ConcurrentProduceConsume1]:    Elapsed time: 133.07ms
[ConcurrentProduceConsume2]:    Elapsed time: 236.686ms
[ConcurrentProduceConsumeIter]: Elapsed time: 400.485ms

(As usual, take the numbers with a grain of salt!)

Increasing the buffer size results in increasingly better performance.

*
*  So, big question: where is the time spent when activating a 
*  "Future" (that is, using a result variable *defined with the 
*  __block modifier)?
*




N is the number of buffers created and consumed. 
C If not otherwise stated in the description, equals the capacity of the 
concurrent buffers' list, that is the max number of buffers it can hold at 
once. C has a very limited effect on performance in this test case.
Buffer size may have sever impact on performance if it is too small.

Description:

SimpleProduceConsume1:
// Create a buffers instance with capacity N, then produce N buffers with 
// BUFFER_SIZE bytes, fill them and when finished, consume and process them.
// Performs sequential in one thread. Allocatates all required buffers
// for the duration of the whole operation.
// The performance may suffer due to the massive allocations.


SimpleProduceConsume2
// Create a buffers instance with capacity 1, then produce N times one buffer
// with size BUFFER_SIZE and fill it. Consume and process it immediately and 
// release the buffer. This versions allocates only one buffer per iteration. 
// Performs in one thread. 
// Performance should be fast compared to other approaches, however since it
// uses the CFDataConstBuffers object it invovls a certain overhead due to 
// its thread safe design (which is not used in this case).


Classic:
// Create one mutable NSData object. For N times create and produce a buffer 
with 
// size BUFFER_SIZE and fill it. Consume the buffer and append the content to 
// the mutbale NSData object. This may involve to reallocate the mutable buffer 
// and may require to copy the content. When finished, process the content of 
the 
// mutable NSData object. 
// This approach seems to have to impose a huge overhead - but it turns out to 
perform 
// quite well, possibly due to internal optimizations.
// Performs in one thread. 


ConcurrentProduceConsume1
// Create a buffers instance with capacity C, then produce N buffers with
// BUFFER_SIZE bytes and fill the buffers. Concurrently consume the buffers.
// Performs concurrently on two threads. Straight forward and possibly fast 
// implementation.


ConcurrentProduceConsume2
// Mimics the ConcurrentProduceConsumeIter using similar code produced by
// the compiler.
// Create a buffers instance with capacity C, then produce N buffers with 
// BUFFER_SIZE bytes and fill them. Concurrently consume the buffers and 
// iterate over the consumed buffer.
// Performs concurrently on two threads. 


ConcurrentProduceConsumeIter
// Create a buffers instance with capacity C, then produce N buffers with 
// BUFFER_SIZE bytes. Concurrently consume the buffers and process it with 
// iterating over the buffer's content.
// Performs concurrently on two threads. 



> Unfortunately without the crucial part of your source code there's not a lot 
> anyone else can do to help you in that respect...
Since the final source will be Open Source anyway, I can provide it here if 
required.

Anyway, I strongly guess there is something related to dispatch, blocks or 
synchronization which affects the performance that heavy.


> 
> Hope that helps a bit
> Jonny


Regards
Andreas


Source can be viewed at:

http://codeviewer.org/view/code:1c40

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to