On Fri, Sep 23, 2011 at 11:16 AM, Mark Knecht <markkne...@gmail.com> wrote: > On Fri, Sep 23, 2011 at 6:49 AM, Michael Mol <mike...@gmail.com> wrote: > While I'm not a programmer at all I have been playing with some CUDA > programming this year. The couple of comments below are based around > that GPU framework and might differ for others. > > 1) I don't think the GPU latencies are much different than CPU > latencies. A lot of it can be done with DMA so that the CPU is hardly > involved once the pointers are set up. Of course it depends on the > system but the GPU is pretty close to the action so it should be quite > fast getting started.
As long as stuff is done wholly in the GPU, the kind of latency I was worried about (GPU<->system RAM<->CPU) isn't a problem. The problem is going to be anything that involves data being passed back and forth, or decisions needing to be made by the CPU. I concur with James that CPU+GPU parts will help a great deal in that regard. > 2) The big deal with GPUs is that they really pay off when you need to > do a lot of the same calculations on different data in parallel. A > book I read + some online stuff suggested they didn't pay off speed > wise until you were doing at least 100 operations in parallel. > > 3) You do have to get the data into the GPU so for things that used > fixed data blocks, like shading graphical elements, that data can be > loaded once and reused over and over. That can be very fast. In my > case it's financial data getting evaluated 1000 ways so that's > effective. For data like a packet I don't know how many ways there are > to evaluate that so I cannot suggest what the value would be. Yeah, that's the problem. Cache loses its utility the less and less you have to revisit the same pieces of data. When they're talking about multiple gigabits per second of throughput, cache won't be much good for more than prefetches. > > None the less it's an interesting idea and certainly offloads computer > cycles that might be better used for other things. Earlier this year, I experimented a little bit in how one could implement a Turing-complete language in a branchless, like on GPGPUs*. I figure it's doable, but you waste cores and memory with discarded results. (Similar to when CPUs mispredict branches misprediction, but worse.) * OK, they're not branchless, but branches kill performance; I recall my reading of the CUDA manual indicating that code has to be brought back in step after a branch before any of the results are available. But that was about two years ago when I read it. > > My NVidia 465GTX has 352 CUDA cores while the GS8200 has only 8 so > there can be a huge difference based on what GPU you have available. -- :wq