> I have spent much time investigating
> that as well, and I couldn't manage to find a method that didn't require
> moving data back and forth between the SIMD registers and the regular
> registers (because you can't branch when using SIMD instructions, and
> branching is somewhat critical to the Huffman algorithm.)

You've probably looked at this, but on x86, the pmovmskb instruction
(_mm_movemask_epi8() intrinsic) is pretty good for branching on the
result of a SIMD compare.

-Justin

On Thu, Oct 27, 2011 at 3:59 PM, DRC <dcomman...@users.sourceforge.net> wrote:
> On 10/27/11 2:30 PM, Siarhei Siamashka wrote:
>> Also huffman decoder optimizations (which are C code, not SIMD) in
>> libjpeg-turbo seem to be providing only some barely measurable
>> improvement on ARM, while huffman speedup is clearly more impressive
>> on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a
>> result.
>
> In general, the Huffman codec improvements produce a greater speedup on
> 64-bit vs. 32-bit and a greater speedup when compressing vs.
> decompressing.  So, whereas libjpeg-turbo's Huffman codec realizes about
> a 25-50% improvement vs. the libjpeg Huffman codec when doing
> compression using 64-bit code, it only realizes a few percent speedup
> vs. libjpeg when doing decompression using 32-bit code.  The Huffman
> algorithm uses a single register as a bit bucket, and the fewer times it
> has to shift in new bits to that register, the faster it is.  That's why
> it's so much faster on 64-bit vs. 32-bit.
>
> The Huffman codec is probably the single biggest piece of low-hanging
> fruit in the entire code base, since it represents something like 40-50%
> of total execution time in many cases.  I've spent hundreds of hours
> looking at it, and the basic problem with the 32-bit code seems to be
> register exhaustion.  After trying many different approaches, the C
> code, as currently written, seems to produce the best possible
> performance on 32-bit x86 without sacrificing any performance on 64-bit
> x86.  However, that doesn't mean that it couldn't be improved upon--
> perhaps even dramatically-- by using hand-written assembly.  Other
> codecs, such as the Intel Performance Primitives, manage to produce
> similar Huffman performance on both 64-bit and 32-bit.  libjpeg-turbo
> can mostly match their 64-bit performance but not their 32-bit
> performance, which leads me to believe that they're doing something
> fundamentally different with their Huffman codec.  Perhaps they are even
> using SIMD instructions, although I have spent much time investigating
> that as well, and I couldn't manage to find a method that didn't require
> moving data back and forth between the SIMD registers and the regular
> registers (because you can't branch when using SIMD instructions, and
> branching is somewhat critical to the Huffman algorithm.)
>
> If someone could manage to fix, or even improve, the way registers are
> used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86.
>
> ------------------------------------------------------------------------------
> The demand for IT networking professionals continues to grow, and the
> demand for specialized networking skills is growing even more rapidly.
> Take a complimentary Learning@Cisco Self-Assessment and learn
> about Cisco certifications, training, and career opportunities.
> http://p.sf.net/sfu/cisco-dev2dev
> _______________________________________________
> Libjpeg-turbo-devel mailing list
> libjpeg-turbo-de...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/libjpeg-turbo-devel
>

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Reply via email to