The 2011 WWDC Blocks and Grand Central Dispatch in practice talks about cache 
line size which I believe is relevant here.

You can read my notes from that session here:

http://blog.yvs.eu.com/2013/07/blocks-and-grand-central-dispatch-in-practice/

Kevin

On 9 Feb 2014, at 08:53, Greg Parker <gpar...@apple.com> wrote:

> On Feb 9, 2014, at 12:19 AM, Gerriet M. Denkmann <gerr...@mdenkmann.de> wrote:
>> The real app (which I am trying to optimise) has actually two loops: one is 
>> counting, the other one is modifying. Which seems to be good news.
>> 
>> But I would really like to understand what I should do. Trial and error (or 
>> blindly groping in the mist) is not really my preferred way of working.
> 
> Optimizing small loops like this is a black art. Very small effects become 
> critically important, such as the alignment of your loop instructions or the 
> associativity of that CPU's L1 cache. 
> 
> Avoiding cache line contention is the first priority for that code. The array 
> of sums is inefficient. Sums for different threads are present on the same 
> cache line and they are written every time through the loop. (The compiler is 
> unlikely to be able to optimize away those writes because it is unlikely to 
> be able to recognize that the threads don't use each other's sums.) Writing 
> to a cache line requires ownership by one thread to the exclusion of the 
> others. This creates a bottleneck where significant time is spent 
> ping-ponging control of a cache line between multiple threads. The code would 
> likely be faster if each thread maintained its own sum in a local variable, 
> and wrote to the array of sums only at the end. That should reduce cache line 
> contention and also make it more likely that the compiler optimizer can keep 
> the sum in a register, avoiding memory entirely.
> 
> Working one byte at a time is almost never the efficient thing to do. If the 
> compiler's autovectorizer isn't improving this code for you then you should 
> look into writing architecture-specific vector code, either in assembly or in 
> the C functions that look almost exactly like vector assembly. Good vector 
> code might be 2-8x faster than byte-at-a-time code. 
> 
> Cache associativity can mean that there are some array split sizes that are 
> much worse than others. If you choose the wrong size then each thread's 
> working memory is on different cache lines, but those cache lines collide 
> with each other in memory caches. Changing the work size to avoid collisions 
> can help.
> 
> Measuring the bottlenecks in this sort of code is difficult. The CPU can 
> count some of the adverse events that this sort of code needs to care about, 
> such as branch mis-prediction and memory cache misses. The Counters and 
> Thread States instruments can record and present some of this information to 
> you. 
> 
> 
> -- 
> Greg Parker     gpar...@apple.com     Runtime Wrangler
> 
> 
> 
> _______________________________________________
> 
> Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)
> 
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
> 
> Help/Unsubscribe/Update your Subscription:
> https://lists.apple.com/mailman/options/cocoa-dev/ktam%40yvs.eu.com
> 
> This email sent to k...@yvs.eu.com


_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to