> > 64 runs seems way too few runs for reliable result. > Please try to run ffmpeg encoding for about a minute. > > About the patch: > > > +%include "libavutil/x86/x86util.asm" > > + > > +SECTION .text > > Please include "x86inc.asm" explicitly. > > > +INIT_XMM sse2 > > +cglobal reorder_pixels, 3,5,3, src, dst, size > > + > > +shr r2, 1;half_size > > It's better to use the labels you define in the cglobal, > If I count correctly, "r2" would be "sizeq". > ("srcq" and "dstq" would be the correct labels for the pointers). > > > > +;calc loop count for simd reorder > > +mov r3, r2; > > +shr r3, 4;calc loop count simd > > Align the instruction at 4 spaces. > Align the first operands so that the ending commas are vertically aligned. > For the follow up operands, just add one space after the comma. > > Put some spaces before the comment, to separate it from the operands. > BTW, this is not C, you don't need to end every line with ";" > e.g. > > + mov r3, r2 > > + shr r3, 4 ;calc loop count simd > > > > +;jump to afterloop2 if loop count simd is 0 > > +cmp r3, 0; > > +jle afterloop2; > > If you only check for size==0, then > the "shr" has already set the correct Z flag. > > However, if r3/size==1, you'd still read > 16 bytes at once in the first loop. > Aka, overread. > > > > +;simd loop > > +loop1: > > +movdqa xmm0, [r0];load first 16 bytes > > Use "m0" instead. > > > > +lea r4, [r0 + r2]; > > + > > +movdqu xmm1, [r4]; unaligned load > > r4 doesn't seem to be used for anything else. > Just move the address calculation directly into > the "movdqu", it can take it. > > > +movdqa xmm2, xmm0;copy xmm0 > > + > > +punpcklbw xmm2, xmm1; > > +movdqa [r1], xmm2; > > +add r1, 16; > > use "mmsize" instead of 16. > > > +movdqa xmm2, xmm0; > > +punpckhbw xmm2, xmm1; > > +movdqa [r1], xmm2; > > +add r1, 16; > > You can actually avoid one of the "add" > if you use [r1+mmsize] and add 32 at the second > "add" (or 2*mmsize). > > > +dec r3; > > +add r0, 16; > > > +; test repeat > > +cmp r3, 0; > > +jge loop1; > > First, "dec" instructions are avoided, > because they do a partial update > on the flag register and > this causes interdependence. > > Second, you can use the flags from > the "sub" to check if it has reached > zero or has gone negative. > > > > +afterloop2: > > +;scalar part > > + > > +mov r3, r2; > > +and r3, 15 ;half_size % 16 > > +lea r4, [r0 + r2]; > > + > > +;initial condition loop > > +cmp r3, 0; > > +jle end; > > + > > +loop2: > > +mov r1, r0; > > +inc r1; > > +mov r1, r4; > > +inc r1; > > +inc r0; > > +inc r4; > > +dec r3; > > +; test repeat > > +cmp r3, 0; > > +jg loop2; > > O_o > This loop does not read or write to memory. > > You can replace the second "cmp" in this > loop with unconditional jump to the > "initial condition loop" compare. > (aka move loop2: two instruction above). > > This is how "for(;;)" is usually implemented in C compilers. > > Be sure to test your function with different sizes, > to salt your output buffers and check for underwrites, overwrites > etc... > > Best Regards. > > > Thanks for your comments,
i will take a look, on all of this. Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel