On 5/25/2017 12:50 PM, Clément Bœsch wrote: > --- > > This is still not benchmarked (written and verified with qemu). > > I typically wrote an alternative implementation for > stereo_interpolate[0] which needs to be compared with the current one: > > function ff_ps_stereo_interpolate_neon, export=1 > ld1 {v0.4S}, [x2] > ld1 {v1.4S}, [x3] > 1: > ld1 {v2.2S}, [x0] > ld1 {v3.2S}, [x1] > fadd v0.4S, v0.4S, v1.4S > fmul v4.2S, v2.2S, v0.S[0] > fmul v5.2S, v2.2S, v0.S[1] > fmla v4.2S, v3.2S, v0.S[2] > fmla v5.2S, v3.2S, v0.S[3] > st1 {v4.2S}, [x0], #8 > st1 {v5.2S}, [x1], #8 > subs w4, w4, #1 > b.gt 1b > ret > endfunc > > I don't know which is faster. For now, the current version follows the > logic I used in stereo_interpolate[1] (the ipdopd one). It's doing less > mult operations, but more shuffling. > > A 3rd alternative would be possible if it was possible to assume len % 2 > was always true (allowing overreading and overwriting by one more entry > basically). Currently, this is not the case. > > Speaking of ipdopd, the factors table and the ext may be clumsy. > ---
[...] > +function ff_ps_stereo_interpolate_ipdopd_neon, export=1 > + movrel x5, ipdopd_factors > + ld1 {v20.4S}, [x5] > + ld1 {v0.4S,v1.4S}, [x2] > + ld1 {v6.4S,v7.4S}, [x3] > +1: > + ld1 {v2.2S}, [x0] > + ld1 {v3.2S}, [x1] > + dup v2.2D, v2.D[0] > + dup v3.2D, v3.D[0] > + fadd v0.4S, v0.4S, v6.4S > + fadd v1.4S, v1.4S, v7.4S > + zip1 v16.4S, v0.4S, v0.4S > + zip2 v17.4S, v0.4S, v0.4S > + zip1 v18.4S, v1.4S, v1.4S > + zip2 v19.4S, v1.4S, v1.4S > + fmul v4.4S, v2.4S, v16.4S > + fmla v4.4S, v3.4S, v17.4S > + ext v2.16B, v2.16B, v2.16B, #4 > + ext v3.16B, v3.16B, v3.16B, #4 > + fmul v5.4S, v2.4S, v18.4S > + fmla v5.4S, v3.4S, v19.4S > + fmla v4.4S, v5.4S, v20.4S You could make ipdopd_factors be 0, INT32_MIN, 0, INT32_MIN then replace the fmla with eor + fadd. No idea if that will actually be faster, though. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel