On Tue, Mar 03, 2009 at 03:44:35PM +0300, Yury Serdyuk wrote:
> So for N multiple of 512 there is very strong drop of performance.
> The question is - why and how to avoid it ?
> 
> In fact, given effect is present for any platforms ( Intel Xeon, 
> Pentium, AMD Athlon, IBM Power PC)
> and for gcc 4.1.2, 4.3.0.
> Moreover, that effect is present for Intel icc compiler also,
> but only till to -O2 option. For -O3, there is good smooth performance.
> Trying to turn on -ftree-vectorize do nothing:

Basically at higher N's you are thrashing the cache.  Computers tend to have
prefetching for sequential access, but accessing one matrix on a column basis
does not fit into that prefetching.  Given caches are a fixed size, sooner or
later the whole matrix will not fit in the cache.  If you have to go out to
main memory, it can cause the processor to take a long time as it waits for the
memory to be betched.  It may be the Intel compiler has better support for
handling matrix multiply.

The usual way is to recode your multiply so that it is more cache friendly.
This is an active research topic, so using google or other search enginee is
your friend.  For instance, this was one of the first links I found with
looking for 'matrix multiple cache'
http://www.cs.umd.edu/class/fall2001/cmsc411/proj01/cache/index.html

-- 
Michael Meissner, IBM
4 Technology Place Drive, MS 2203A, Westford, MA, 01886, USA
meiss...@linux.vnet.ibm.com

Reply via email to