On Feb 9, 2014, at 12:19 AM, Gerriet M. Denkmann <gerr...@mdenkmann.de> wrote:
> The real app (which I am trying to optimise) has actually two loops: one is 
> counting, the other one is modifying. Which seems to be good news.
> 
> But I would really like to understand what I should do. Trial and error (or 
> blindly groping in the mist) is not really my preferred way of working.

Optimizing small loops like this is a black art. Very small effects become 
critically important, such as the alignment of your loop instructions or the 
associativity of that CPU's L1 cache. 

Avoiding cache line contention is the first priority for that code. The array 
of sums is inefficient. Sums for different threads are present on the same 
cache line and they are written every time through the loop. (The compiler is 
unlikely to be able to optimize away those writes because it is unlikely to be 
able to recognize that the threads don't use each other's sums.) Writing to a 
cache line requires ownership by one thread to the exclusion of the others. 
This creates a bottleneck where significant time is spent ping-ponging control 
of a cache line between multiple threads. The code would likely be faster if 
each thread maintained its own sum in a local variable, and wrote to the array 
of sums only at the end. That should reduce cache line contention and also make 
it more likely that the compiler optimizer can keep the sum in a register, 
avoiding memory entirely.

Working one byte at a time is almost never the efficient thing to do. If the 
compiler's autovectorizer isn't improving this code for you then you should 
look into writing architecture-specific vector code, either in assembly or in 
the C functions that look almost exactly like vector assembly. Good vector code 
might be 2-8x faster than byte-at-a-time code. 

Cache associativity can mean that there are some array split sizes that are 
much worse than others. If you choose the wrong size then each thread's working 
memory is on different cache lines, but those cache lines collide with each 
other in memory caches. Changing the work size to avoid collisions can help.

Measuring the bottlenecks in this sort of code is difficult. The CPU can count 
some of the adverse events that this sort of code needs to care about, such as 
branch mis-prediction and memory cache misses. The Counters and Thread States 
instruments can record and present some of this information to you. 


-- 
Greg Parker     gpar...@apple.com     Runtime Wrangler



_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to