On 8/5/17, Martin Vignali <martin.vign...@gmail.com> wrote: > Hello, > > Based on pull request in the official openexr library : > https://github.com/openexr/openexr/pull/229/commits/4198128397c033d4f69e5cc0833195da500c31cf > > i try to port the reorder part (used in zip and rle decompression), to asm > Tested on OS X > pass fate test for me > > I test with this command : > ./ffmpeg -i ./fate-suite/exr/rgba_slice_zip16.exr -f null -;./ffmpeg -i > ./fate-suite/exr/rgba_slice_zip16.exr -f null - -cpuflags 0 > > the result of the timer are : > scalar : 960355 decicycles in reorder_pixels_zip, 64 runs, 0 > skips > sse : 84763 decicycles in reorder_pixels_zip, 64 runs, 0 > skips > > Comments Welcome > > Martin > Jokyo Images
64 runs seems way too few runs for reliable result. Please try to run ffmpeg encoding for about a minute. About the patch: > +%include "libavutil/x86/x86util.asm" > + > +SECTION .text Please include "x86inc.asm" explicitly. > +INIT_XMM sse2 > +cglobal reorder_pixels, 3,5,3, src, dst, size > + > +shr r2, 1;half_size It's better to use the labels you define in the cglobal, If I count correctly, "r2" would be "sizeq". ("srcq" and "dstq" would be the correct labels for the pointers). > +;calc loop count for simd reorder > +mov r3, r2; > +shr r3, 4;calc loop count simd Align the instruction at 4 spaces. Align the first operands so that the ending commas are vertically aligned. For the follow up operands, just add one space after the comma. Put some spaces before the comment, to separate it from the operands. BTW, this is not C, you don't need to end every line with ";" e.g. > + mov r3, r2 > + shr r3, 4 ;calc loop count simd > +;jump to afterloop2 if loop count simd is 0 > +cmp r3, 0; > +jle afterloop2; If you only check for size==0, then the "shr" has already set the correct Z flag. However, if r3/size==1, you'd still read 16 bytes at once in the first loop. Aka, overread. > +;simd loop > +loop1: > +movdqa xmm0, [r0];load first 16 bytes Use "m0" instead. > +lea r4, [r0 + r2]; > + > +movdqu xmm1, [r4]; unaligned load r4 doesn't seem to be used for anything else. Just move the address calculation directly into the "movdqu", it can take it. > +movdqa xmm2, xmm0;copy xmm0 > + > +punpcklbw xmm2, xmm1; > +movdqa [r1], xmm2; > +add r1, 16; use "mmsize" instead of 16. > +movdqa xmm2, xmm0; > +punpckhbw xmm2, xmm1; > +movdqa [r1], xmm2; > +add r1, 16; You can actually avoid one of the "add" if you use [r1+mmsize] and add 32 at the second "add" (or 2*mmsize). > +dec r3; > +add r0, 16; > +; test repeat > +cmp r3, 0; > +jge loop1; First, "dec" instructions are avoided, because they do a partial update on the flag register and this causes interdependence. Second, you can use the flags from the "sub" to check if it has reached zero or has gone negative. > +afterloop2: > +;scalar part > + > +mov r3, r2; > +and r3, 15 ;half_size % 16 > +lea r4, [r0 + r2]; > + > +;initial condition loop > +cmp r3, 0; > +jle end; > + > +loop2: > +mov r1, r0; > +inc r1; > +mov r1, r4; > +inc r1; > +inc r0; > +inc r4; > +dec r3; > +; test repeat > +cmp r3, 0; > +jg loop2; O_o This loop does not read or write to memory. You can replace the second "cmp" in this loop with unconditional jump to the "initial condition loop" compare. (aka move loop2: two instruction above). This is how "for(;;)" is usually implemented in C compilers. Be sure to test your function with different sizes, to salt your output buffers and check for underwrites, overwrites etc... Best Regards. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel