On Tue, Mar 03, 2009 at 03:44:35PM +0300, Yury Serdyuk wrote: > So for N multiple of 512 there is very strong drop of performance. > The question is - why and how to avoid it ? > > In fact, given effect is present for any platforms ( Intel Xeon, > Pentium, AMD Athlon, IBM Power PC) > and for gcc 4.1.2, 4.3.0. > Moreover, that effect is present for Intel icc compiler also, > but only till to -O2 option. For -O3, there is good smooth performance. > Trying to turn on -ftree-vectorize do nothing:
Basically at higher N's you are thrashing the cache. Computers tend to have prefetching for sequential access, but accessing one matrix on a column basis does not fit into that prefetching. Given caches are a fixed size, sooner or later the whole matrix will not fit in the cache. If you have to go out to main memory, it can cause the processor to take a long time as it waits for the memory to be betched. It may be the Intel compiler has better support for handling matrix multiply. The usual way is to recode your multiply so that it is more cache friendly. This is an active research topic, so using google or other search enginee is your friend. For instance, this was one of the first links I found with looking for 'matrix multiple cache' http://www.cs.umd.edu/class/fall2001/cmsc411/proj01/cache/index.html -- Michael Meissner, IBM 4 Technology Place Drive, MS 2203A, Westford, MA, 01886, USA meiss...@linux.vnet.ibm.com