Hello, Thanks Ivan for your comments and explanations,
--- > > [...] > > +;********************************************************** > ******************** > > + > > +%include "libavutil/x86/x86util.asm" > > Still missing explicit x86inc.asm > if i include x86inc instead of x86util, i have linker error (seems that the prefixe of func become x264, instead of ff) > > + > > + shr sizeq, 1; sizeq = half_size > > + mov r3, sizeq > > + shr r3, 4; r3 = half_size/16 -> loop_simd > count > > + > > +loop_simd: > > +;initial condition loop > > + jle after_loop_simd; jump to scalar part if loop_simd > count(r3) is 0 > > + > > + movdqa m0, [srcq]; load first part > > + movdqu m1, [srcq + sizeq]; load second part > > Would you test if moving the movdqu first makes any difference in speed? > I had a similar case and I think that makes it faster, > since movdqu has bigger latency. > Might not matter on newer cpu. > > (If you can't tell the difference, leave it as it is.) > Doesn't notice speed difference. For the rest of your comments : You're right, i can remove the scalar part, the src and dst buffer seems to be padded to 32 in av_fast_padded_malloc So for the SSE version, can be enough to not overread, overwrite But need to take care of that, for an avx2 version I also modify the loop, following your comments. I offset src, and src2, by half_size, and dst by 2*half_size, so i can remove some add, sub and i use half_size * -1, for offset src, src2, and dst The current asm version is that : (still WIP, but pass fate test for me) Need to better check, the max overread, overwrite, for several size value %include "libavutil/x86/x86util.asm" SECTION .text ;------------------------------------------------------------------------------ ; void ff_reorder_pixels(uint8_t *src, uint8_t *dst, int size) ;------------------------------------------------------------------------------ INIT_XMM sse2 cglobal reorder_pixels, 3,5,3, src, dst, size add dstq, sizeq; offset dstq by 2* half_size shr sizeq, 1; sizeq = half_size mov r3, sizeq; r3 = half_size add srcq, r3; offset src by half_size mov r4, srcq; r4 is the start of the second part of the buffer add r4, r3; offset r4 by half_size neg r3; r3 = half_size * -1 (offset of dst, src, src2 (r4)) loop_simd: ;initial condition loop jge end; movdqa m0, [srcq+r3]; load first part movdqu m1, [r4 +r3] ; load second part punpcklbw m2, m0, m1; interleaved part 1 movdqa [dstq+r3*2], m2; copy to dst array punpckhbw m0, m1; interleaved part 2 movdqa [dstq+r3*2+mmsize], m0; copy to dst array add r3, mmsize jmp loop_simd end: RET For the perf, the current state is : Scalar : 3082024 decicycles in reorder_pixels_zip, 130413 runs, 659 skips bench: utime=115.926s bench: maxrss=607670272kB SSE ASM : 296370 decicycles in reorder_pixels_zip, 130946 runs, 126 skips bench: utime=101.481s bench: maxrss=607698944kB SSE Intrinsics 289448 decicycles in reorder_pixels_zip, 130944 runs, 128 skips bench: utime=101.417s bench: maxrss=607694848kB After taking a look at the asm code generate by clang from intrinsics version (in O2) seems like, clang modify the loop_simd part, in order to process twice more bytes inside the loop (and it add a condition, to process odd half_size) I will try to make some test for that, to see if i can have a speed improvement using the same method Martin <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel> _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel