On Fri, Sep 22, 2017 at 11:12 PM, Martin Vignali <martin.vign...@gmail.com> wrote: > +static void predictor_scalar(uint8_t *src, ptrdiff_t size) > +{ > + uint8_t *t = src + 1; > + uint8_t *stop = src + size; > + > + while (t < stop) { > + int d = (int) t[-1] + (int) t[0] - 128; > + t[0] = d; > + ++t; > + } > +}
Can be simplified quite a bit: static void predictor_scalar(uint8_t *src, ptrdiff_t size) { for (size_t i = 1; i < size; i++) src[i] += src[i-1] - 128; } > +SECTION_RODATA 32 > + > +neg_128: times 16 db -128 > +shuffle_15: times 16 db 15 Drop the 32-byte alignment from the section directive, we don't need it here. db -128 is weird since it's identical to +128. I would rename those as such: pb_128: times 16 db 128 pb_15: times 16 db 15 > +INIT_XMM ssse3 > +cglobal predictor, 2,3,5, src, size, tmp > + > + mov tmpb, [srcq] > + xor tmpb, -128 > + mov [srcq], tmpb > + > +;offset src by size > + add srcq, sizeq > + neg sizeq ; size = offset for src > + > +;init mm > + mova m0, [neg_128] ; m0 = const for xor high byte > + mova m1, [shuffle_15] ; m1 = shuffle mask > + pxor m2, m2 ; m2 = prev_buffer > + > + > +.loop: > + mova m3, [srcq + sizeq] > + pxor m3, m0 > + > + ;compute prefix sum > + mova m4, m3 > + pslldq m4, 1 > + > + paddb m4, m3 > + mova m3, m4 > + pslldq m3, 2 > + > + paddb m3, m4 > + mova m4, m3 > + pslldq m4, 4 > + > + paddb m4, m3 > + mova m3, m4 > + pslldq m3, 8 > + > + paddb m4, m2 > + paddb m4, m3 > + > + mova [srcq + sizeq], m4 > + > + ;broadcast high byte for next iter > + pshufb m4, m1 > + mova m2, m4 > + > + add sizeq, mmsize > + jl .loop > + RET %macro PREDICTOR 0 cglobal predictor, 2,3,5, src, size, tmp %if mmsize == 32 vbroadcasti128 m0, [pb_128] %else mova xm0, [pb_128] %endif mova xm1, [pb_15] mova xm2, xm0 add srcq, sizeq neg sizeq .loop: pxor m3, m0, [srcq + sizeq] pslldq m4, m3, 1 paddb m3, m4 pslldq m4, m3, 2 paddb m3, m4 pslldq m4, m3, 4 paddb m3, m4 pslldq m4, m3, 8 %if mmsize == 32 paddb m3, m4 paddb xm2, xm3 vextracti128 xm4, m3, 1 mova [srcq + sizeq], xm2 pshufb xm2, xm1 paddb xm2, xm4 mova [srcq + sizeq + 16], xm2 %else paddb m2, m3 paddb m2, m4 mova [srcq + sizeq], m2 %endif pshufb xm2, xm1 add sizeq, mmsize jl .loop RET %endmacro INIT_XMM ssse3 PREDICTOR INIT_XMM avx PREDICTOR %if HAVE_AVX2_EXTERNAL INIT_YMM avx2 PREDICTOR %endif predictor_c: 15351.5 predictor_ssse3: 1206.5 predictor_avx: 1207.5 predictor_avx2: 880.0 On SKL-X. Only tested in checkasm. AVX is same speed as SSSE3 since modern Intel CPU:s eliminate reg-reg moves in the register renaming stage, but somewhat older CPU:s such as Sandy Bridge, which is still quite popular, does not so it should help there. On Fri, Sep 22, 2017 at 11:12 PM, Martin Vignali <martin.vign...@gmail.com> wrote: > Hello, > > in attach a patch > with a port to asm of the predictor part of this patch : > > https://github.com/openexr/openexr/pull/229/commits/4198128397c033d4f69e5cc0833195da500c31cf > > Tested on OSX, pass fate test for me > Check asm also pass for me > > Results with reorder simd disable : > SSSE3 : 94.5s > 1036758 decicycles in predictor, 130751 runs, 321 skips > > Scalar : 114s > 4255109 decicycles in predictor, 130276 runs, 796 skips > > using reorder and predictor simd : 82.6s > > > Check asm benchmark : > ./tests/checkasm/checkasm --test=exrdsp --bench > > predictor_c: 10635.1 > predictor_ssse3: 1634.6 > > > Comments welcome > > > Martin > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel