On 12/3/17, Martin Vignali <martin.vign...@gmail.com> wrote: >> >> In any case, if clang or gcc can generate better code, then the hand >> written version needs to be optimized to be as fast or faster. >> >> >> > Quick test : pass checkasm (but probably only because width = 256) > hflip_byte_c: 26.4 > hflip_byte_ssse3: 20.4 > > > INIT_XMM ssse3 > cglobal hflip_byte, 3, 5, 2, src, dst, w, x, v, src2 > mova m0, [pb_flip_byte] > xor xq, xq ; <====== > mov wd, dword wm > sub wq, mmsize * 2 > ;remove the cmp here <====== > jl .skip > > .loop0: ; process two xmm in the loop > neg xq > movu m1, [srcq + xq - mmsize + 1] > movu m2, [srcq + xq - mmsize * 2 + 1] <====== > pshufb m1, m0 > pshufb m2, m0 <====== > neg xq > movu [dstq + xq], m1 > movu [dstq + xq + mmsize], m2 <====== > add xq, mmsize * 2 <====== > cmp xq, wq > jl .loop0 > RET ; add RET here > > ; MISSING one xmm process if need > > .skip: > add wq, mmsize > .loop1: > neg xq > mov vb, [srcq + xq] > neg xq > mov [dstq + xq], vb > add xq, 1 > cmp xq, wq > jl .loop1 > RET
So what is wrong now? _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel