Two ideas here. The first 3 patches alter the old mmx code so that it can use xmm registers. It still only uses half the available width and adds a few shuffles meaning it isn't an ideal solution. Though it is exact compared with the mmx version. Seems to be moderately faster of Skylake despite the shuffles but similar speed on Yorkfield (like some of my previous work). Possibly useful if anybody still uses a 32-bit build on these CPUs.
The 4th patch is a bit of cleanup I did while reading and partly redoing the 10-bit simple_idct. It uses the named registers to remove a little indirection. Not used everywhere, yet. It could be applied regardless of any other of these patches. The last 2 are an attempt to use the 10- and 12-bit macros. I don't think it is correct, perhaps due to rounding or due to a small difference in the coefficients used. Changing these causes other errors. James Darnley (6): initial alignment corrections for xmm registers change explicit mmx register use to x264asm style add and fix xmm version of simple_idct avcodec/x86: cleanup simple_idct10 add x86_64 8-bit simple_idct function change coeffs libavcodec/tests/x86/dct.c | 5 + libavcodec/x86/idctdsp_init.c | 11 + libavcodec/x86/proresdsp.asm | 2 +- libavcodec/x86/simple_idct.asm | 1242 +++++++++++++++-------------- libavcodec/x86/simple_idct.h | 4 + libavcodec/x86/simple_idct10.asm | 18 +- libavcodec/x86/simple_idct10_template.asm | 64 +- 7 files changed, 715 insertions(+), 631 deletions(-) -- 2.12.2 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel