On Tuesday 07 July 2009 02:08:57 Bradbev wrote: > On Jul 6, 4:30 pm, fft1976 <fft1...@gmail.com> wrote: > > On Jul 5, 11:42 pm, Bradbev <brad.beveri...@gmail.com> wrote: > > > more to modern x86 chips. After you have the best algorithm for the > > > job, you very quickly find that going fast is entirely bound by memory > > > speed (actually latency) - cache misses are the enemy. > > > > IME (outside JVM), this depends strongly on the kind of problem you > > are solving as well as your implementation (you need to know how to > > cache-optimize). One can easily think of problems that would fit > > entirely in cache, but take an enormous amount of time. > > What sort of problems did you have in mind? Anything that I can think > of quickly spills the cache.
There are many examples in scientific computing where many small problems are attacked instead of one large problem. For example, practical use of FFTs fall into this category with most users performing many transforms with no more than 1,024 elements rather than fewer longer transforms. Time frequency analysis usually takes n samples and produces an nxn grid over time and frequency representing the signal where each frequency is computed from a separate FFT. So you can get away with naive distribution of FFTs across cores with no regard for cache coherence and still get very good performance. This is somewhat reflected in the SciMark2 benchmark, which has small and large variants. Most people are interested in the in-cache small variant because it is more practically relevant. Cache coherence is still relevant in some of the subtasks even in the "small" case but it is a much smaller effect. I am in the process of reimplementing stock numerical algorithms (Cholesky, LU, QR, Eigenvalues) for our F# for Numerics product, written entirely in F# and parallelized using the TPL. Surprisingly, some of my F# code is 3x faster than the equivalent MATLAB for small (e.g. 100x100 matrices) problems even though MATLAB is calling directly into vendor-tuned code. For larger problems (e.g. 1000x1000 matrices), MATLAB is usually 10x faster because it is using code that was carefully optimized for cache issues. I would actually say that intercore cache effects are more important than conventional cache coherence today because you get massive performance degradation if you cock it up and it is not at all obvious when that might occur because it depends upon things like where the allocator places your data. For example, if you have cores mutating global counters then you must make sure they are spaced out enough in memory that none share cache lines. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---