On Tuesday 07 July 2009 02:08:57 Bradbev wrote:
> On Jul 6, 4:30 pm, fft1976 <fft1...@gmail.com> wrote:
> > On Jul 5, 11:42 pm, Bradbev <brad.beveri...@gmail.com> wrote:
> > > more to modern x86 chips.  After you have the best algorithm for the
> > > job, you very quickly find that going fast is entirely bound by memory
> > > speed (actually latency) - cache misses are the enemy.
> >
> > IME (outside JVM), this depends strongly on the kind of problem you
> > are solving as well as your implementation (you need to know how to
> > cache-optimize). One can easily think of problems that would fit
> > entirely in cache, but take an enormous amount of time.
>
> What sort of problems did you have in mind?  Anything that I can think
> of quickly spills the cache.
There are many examples in scientific computing where many small problems are 
attacked instead of one large problem. For example, practical use of FFTs 
fall into this category with most users performing many transforms with no 
more than 1,024 elements rather than fewer longer transforms. Time frequency 
analysis usually takes n samples and produces an nxn grid over time and 
frequency representing the signal where each frequency is computed from a 
separate FFT. So you can get away with naive distribution of FFTs across 
cores with no regard for cache coherence and still get very good performance.

This is somewhat reflected in the SciMark2 benchmark, which has small and 
large variants. Most people are interested in the in-cache small variant 
because it is more practically relevant. Cache coherence is still relevant in 
some of the subtasks even in the "small" case but it is a much smaller 
effect.

I am in the process of reimplementing stock numerical algorithms (Cholesky, 
LU, QR, Eigenvalues) for our F# for Numerics product, written entirely in F# 
and parallelized using the TPL. Surprisingly, some of my F# code is 3x faster 
than the equivalent MATLAB for small (e.g. 100x100 matrices) problems even 
though MATLAB is calling directly into vendor-tuned code. For larger problems 
(e.g. 1000x1000 matrices), MATLAB is usually 10x faster because it is using 
code that was carefully optimized for cache issues.

I would actually say that intercore cache effects are more important than 
conventional cache coherence today because you get massive performance 
degradation if you cock it up and it is not at all obvious when that might 
occur because it depends upon things like where the allocator places your 
data. For example, if you have cores mutating global counters then you must 
make sure they are spaced out enough in memory that none share cache lines.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to