El 13/11/15 a las 19:32, Joel Rees escribió:
Do you use the -S option to get assembly language output to look at?
Yes.
Yeah, SIMD instructions can also be used as a rather expensive
substitute for pure DMA. Expensive because they are used for more than
pure data move.
What do you mean by "pure DMA" in this context?. If you mean "direct
memory access", that is a concept that applies to peripherals who access
memory directly, not the CPU.
Those computations aren't as simple as you might think. Quite a bit of
branching, and the instruction cache having to restart, etc. It's
defnitely not SIMD, unless your SIMD has table lookup that can operate
independently in each data stream.
The computation does not require any branching (looping throughout the
array not counted as part of the computation). Conditional expressions
are evaluated using masks. I have already implemented it that way
processing 48 input bits at a time that expand to 64 output bits, using
64 bits integers. It's like SSE or AVX, but the distinction between
vector elements is done completely in software; as far as hardware is
concerned it's 64 bits logic and arithmetic. This implementation is like
a "prototype" that I made so that I could actually evaluate how complex
it is. It is simpler than I thought; with SSE or AVX it will be simpler
still, since they have the separation between vector elements built in
and have instructions dedicated to making masks and shuffling data.
Also, SSE can process 96 bits of input that expand to 128 bits of output
at a time (more with wider AVX versions). If it takes 16 instructions to
process one such block (and I predict that it will take less), that is
only 1 instruction per byte of output, to be compared to a lookup table
implementation that has a lower bound of 1 logical operation (a shift)
AND 1 byte memory access per byte of output. This provides a very rough
comparison of complexity (in terms of numbers of instructions) and
performance to address your concern about how complex the SIMD
implementation is.
As for "instruction cache having to restart": I have never read that in
any optimization or ISA manual for the case of SSE or AVX; that seems to
be a misconception of you. Can you point to a source?.
[3]: A lookup table implementation access the input data *and* the lookup
table;
That's a really small table. Thirty years ago, it would have caused
cache thrashing. Not so much now.
It is not about cache size but the required cache throughput. Using a
lookup table means more reads than an arithmetical implementation (like
I said: it requires reading the input data *and* the table), and data
has to be read and written in very small blocks (1 byte for the lookup
table, 6 bytes at most for the input data), which is generally slower
than using bigger blocks.
replacing the lookup table with SIMD arithmetic reduces the demand on
memory (including cache) throughput.
Does the AVX have that kind of table access? If you have a url for the
relevant specs, I'd be interested.
AVX has gather support. You can look up the details in the Intel's or
AMD's ISA manuals. However, _note that I am *not* talking about
accessing a table with SIMD instructions._ I am completely replacing the
lookup table with arithmetic done using SIMD instructions.