to...@tuxteam.de writes:
I see. But a soft emulation won't give you an idea of performance
anyway? Just thinking about the whole mess from caching down to
instruction set (all of which the emulator has wildly different
timings for)... I'd guess that the single/multi-thread issue is
just a ripple in a sea of uncertainty.
I think expecting just a guess for the timings from an emulator
(at least at this level) is too much. You'd be better off with
your back-of-theenvelope calculations (and then testing, once you
get your hands on "real" hardware).
I agree. Thanks for pointing this reasoning.
monn...@iro.umontreal.ca writes:
I think the question was: what makes you think AVX will improve
the performance of *your* code? Base64 encoding/decoding should be
completely bandwidth-constrained, so it seems very unlikely that AVX
could make much of a difference.
Maybe it's bandwidth constrained; I can't tell beforehand (and I don't
think you can either); I could only said that with some certainty after
doing tests.
I did some limited testing but not enough yet. Depending on the testing
method and the specific Base64 implementation, memcpy is significantly
faster than a typical lookup table in memory implementation of Base64;
indicating that computation has a non-negligible role in performance
(opposed to being memory constrained).
Answering your question: What makes me think that AVX, SSE, or similar
SIMD instruction sets will improve the performance of my code is:
[1] SIMD instructions are more efficient for copying memory because they
have less dispatch overhead since they copy in bigger blocks. memcpy
usually takes advantage of that; so there is a benefit in the case that
the problem is bandwidth constrained.
[2] Although Base64 is usually implemented with a lookup table, the
encoding can be performed by relatively simple arithmetical
computations, because the mapping can be described by bit expansion (6
bits to 8) and mapping a few continuous input ranges to output ranges.
For example: 0 to 25 are mapped to 'A' (ASCII 65) to 'B' (ASCII 90).
[3]: A lookup table implementation access the input data *and* the
lookup table; replacing the lookup table with SIMD arithmetic reduces
the demand on memory (including cache) throughput.
I do not claim to be certain that this will improve performance, but
there is a very good possibility that it does, and I will know (in my
particular case, for my particular CPU) after completing my implementation.
Regards.