On 10/27/11 2:30 PM, Siarhei Siamashka wrote:
> Also huffman decoder optimizations (which are C code, not SIMD) in
> libjpeg-turbo seem to be providing only some barely measurable
> improvement on ARM, while huffman speedup is clearly more impressive
> on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a
> result.

In general, the Huffman codec improvements produce a greater speedup on
64-bit vs. 32-bit and a greater speedup when compressing vs.
decompressing.  So, whereas libjpeg-turbo's Huffman codec realizes about
a 25-50% improvement vs. the libjpeg Huffman codec when doing
compression using 64-bit code, it only realizes a few percent speedup
vs. libjpeg when doing decompression using 32-bit code.  The Huffman
algorithm uses a single register as a bit bucket, and the fewer times it
has to shift in new bits to that register, the faster it is.  That's why
it's so much faster on 64-bit vs. 32-bit.

The Huffman codec is probably the single biggest piece of low-hanging
fruit in the entire code base, since it represents something like 40-50%
of total execution time in many cases.  I've spent hundreds of hours
looking at it, and the basic problem with the 32-bit code seems to be
register exhaustion.  After trying many different approaches, the C
code, as currently written, seems to produce the best possible
performance on 32-bit x86 without sacrificing any performance on 64-bit
x86.  However, that doesn't mean that it couldn't be improved upon--
perhaps even dramatically-- by using hand-written assembly.  Other
codecs, such as the Intel Performance Primitives, manage to produce
similar Huffman performance on both 64-bit and 32-bit.  libjpeg-turbo
can mostly match their 64-bit performance but not their 32-bit
performance, which leads me to believe that they're doing something
fundamentally different with their Huffman codec.  Perhaps they are even
using SIMD instructions, although I have spent much time investigating
that as well, and I couldn't manage to find a method that didn't require
moving data back and forth between the SIMD registers and the regular
registers (because you can't branch when using SIMD instructions, and
branching is somewhat critical to the Huffman algorithm.)

If someone could manage to fix, or even improve, the way registers are
used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86.

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Reply via email to