> I have spent much time investigating > that as well, and I couldn't manage to find a method that didn't require > moving data back and forth between the SIMD registers and the regular > registers (because you can't branch when using SIMD instructions, and > branching is somewhat critical to the Huffman algorithm.)
You've probably looked at this, but on x86, the pmovmskb instruction (_mm_movemask_epi8() intrinsic) is pretty good for branching on the result of a SIMD compare. -Justin On Thu, Oct 27, 2011 at 3:59 PM, DRC <dcomman...@users.sourceforge.net> wrote: > On 10/27/11 2:30 PM, Siarhei Siamashka wrote: >> Also huffman decoder optimizations (which are C code, not SIMD) in >> libjpeg-turbo seem to be providing only some barely measurable >> improvement on ARM, while huffman speedup is clearly more impressive >> on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a >> result. > > In general, the Huffman codec improvements produce a greater speedup on > 64-bit vs. 32-bit and a greater speedup when compressing vs. > decompressing. So, whereas libjpeg-turbo's Huffman codec realizes about > a 25-50% improvement vs. the libjpeg Huffman codec when doing > compression using 64-bit code, it only realizes a few percent speedup > vs. libjpeg when doing decompression using 32-bit code. The Huffman > algorithm uses a single register as a bit bucket, and the fewer times it > has to shift in new bits to that register, the faster it is. That's why > it's so much faster on 64-bit vs. 32-bit. > > The Huffman codec is probably the single biggest piece of low-hanging > fruit in the entire code base, since it represents something like 40-50% > of total execution time in many cases. I've spent hundreds of hours > looking at it, and the basic problem with the 32-bit code seems to be > register exhaustion. After trying many different approaches, the C > code, as currently written, seems to produce the best possible > performance on 32-bit x86 without sacrificing any performance on 64-bit > x86. However, that doesn't mean that it couldn't be improved upon-- > perhaps even dramatically-- by using hand-written assembly. Other > codecs, such as the Intel Performance Primitives, manage to produce > similar Huffman performance on both 64-bit and 32-bit. libjpeg-turbo > can mostly match their 64-bit performance but not their 32-bit > performance, which leads me to believe that they're doing something > fundamentally different with their Huffman codec. Perhaps they are even > using SIMD instructions, although I have spent much time investigating > that as well, and I couldn't manage to find a method that didn't require > moving data back and forth between the SIMD registers and the regular > registers (because you can't branch when using SIMD instructions, and > branching is somewhat critical to the Huffman algorithm.) > > If someone could manage to fix, or even improve, the way registers are > used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86. > > ------------------------------------------------------------------------------ > The demand for IT networking professionals continues to grow, and the > demand for specialized networking skills is growing even more rapidly. > Take a complimentary Learning@Cisco Self-Assessment and learn > about Cisco certifications, training, and career opportunities. > http://p.sf.net/sfu/cisco-dev2dev > _______________________________________________ > Libjpeg-turbo-devel mailing list > libjpeg-turbo-de...@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/libjpeg-turbo-devel > _______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev