On 12/3/17, Martin Vignali <martin.vign...@gmail.com> wrote: > 2017-12-03 20:36 GMT+01:00 Paul B Mahol <one...@gmail.com>: > >> On 12/3/17, Martin Vignali <martin.vign...@gmail.com> wrote: >> >> >> >> In any case, if clang or gcc can generate better code, then the hand >> >> written version needs to be optimized to be as fast or faster. >> >> >> >> >> >> >> > Quick test : pass checkasm (but probably only because width = 256) >> > hflip_byte_c: 26.4 >> > hflip_byte_ssse3: 20.4 >> > >> > >> > INIT_XMM ssse3 >> > cglobal hflip_byte, 3, 5, 2, src, dst, w, x, v, src2 >> > mova m0, [pb_flip_byte] >> > xor xq, xq ; <====== >> > mov wd, dword wm >> > sub wq, mmsize * 2 >> > ;remove the cmp here <====== >> > jl .skip >> > >> > .loop0: ; process two xmm in the loop >> > neg xq >> > movu m1, [srcq + xq - mmsize + 1] >> > movu m2, [srcq + xq - mmsize * 2 + 1] <====== >> > pshufb m1, m0 >> > pshufb m2, m0 <====== >> > neg xq >> > movu [dstq + xq], m1 >> > movu [dstq + xq + mmsize], m2 <====== >> > add xq, mmsize * 2 <====== >> > cmp xq, wq >> > jl .loop0 >> > RET ; add RET here >> > >> > ; MISSING one xmm process if need >> > >> > .skip: >> > add wq, mmsize >> > .loop1: >> > neg xq >> > mov vb, [srcq + xq] >> > neg xq >> > mov [dstq + xq], vb >> > add xq, 1 >> > cmp xq, wq >> > jl .loop1 >> > RET >> >> So what is wrong now? >> > > Doesn't see your email, when i send mine. > > Check asm result with your last patch (and modify for the short version > "add xq, mmsize" to "add xq, mmsize * 2") > hflip_byte_c: 28.0 > hflip_byte_ssse3: 127.5 > hflip_short_c: 276.5 > hflip_short_ssse3: 100.2 >
Ops, fixed. > > Do you think if you add RET after the end of loop0 , it can work in all > cases ? No, it would try to read before src, and crash. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel