The 2011 WWDC Blocks and Grand Central Dispatch in practice talks about cache line size which I believe is relevant here.
You can read my notes from that session here: http://blog.yvs.eu.com/2013/07/blocks-and-grand-central-dispatch-in-practice/ Kevin On 9 Feb 2014, at 08:53, Greg Parker <gpar...@apple.com> wrote: > On Feb 9, 2014, at 12:19 AM, Gerriet M. Denkmann <gerr...@mdenkmann.de> wrote: >> The real app (which I am trying to optimise) has actually two loops: one is >> counting, the other one is modifying. Which seems to be good news. >> >> But I would really like to understand what I should do. Trial and error (or >> blindly groping in the mist) is not really my preferred way of working. > > Optimizing small loops like this is a black art. Very small effects become > critically important, such as the alignment of your loop instructions or the > associativity of that CPU's L1 cache. > > Avoiding cache line contention is the first priority for that code. The array > of sums is inefficient. Sums for different threads are present on the same > cache line and they are written every time through the loop. (The compiler is > unlikely to be able to optimize away those writes because it is unlikely to > be able to recognize that the threads don't use each other's sums.) Writing > to a cache line requires ownership by one thread to the exclusion of the > others. This creates a bottleneck where significant time is spent > ping-ponging control of a cache line between multiple threads. The code would > likely be faster if each thread maintained its own sum in a local variable, > and wrote to the array of sums only at the end. That should reduce cache line > contention and also make it more likely that the compiler optimizer can keep > the sum in a register, avoiding memory entirely. > > Working one byte at a time is almost never the efficient thing to do. If the > compiler's autovectorizer isn't improving this code for you then you should > look into writing architecture-specific vector code, either in assembly or in > the C functions that look almost exactly like vector assembly. Good vector > code might be 2-8x faster than byte-at-a-time code. > > Cache associativity can mean that there are some array split sizes that are > much worse than others. If you choose the wrong size then each thread's > working memory is on different cache lines, but those cache lines collide > with each other in memory caches. Changing the work size to avoid collisions > can help. > > Measuring the bottlenecks in this sort of code is difficult. The CPU can > count some of the adverse events that this sort of code needs to care about, > such as branch mis-prediction and memory cache misses. The Counters and > Thread States instruments can record and present some of this information to > you. > > > -- > Greg Parker gpar...@apple.com Runtime Wrangler > > > > _______________________________________________ > > Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) > > Please do not post admin requests or moderator comments to the list. > Contact the moderators at cocoa-dev-admins(at)lists.apple.com > > Help/Unsubscribe/Update your Subscription: > https://lists.apple.com/mailman/options/cocoa-dev/ktam%40yvs.eu.com > > This email sent to k...@yvs.eu.com _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com