On 25/01/15 10:11 AM, Christophe Gisquet wrote: > Hi, > > 2015-01-25 2:05 GMT+01:00 James Almer <jamr...@gmail.com>: >> 2 to 2.5 times faster. >> >> Signed-off-by: James Almer <jamr...@gmail.com> >> --- >> libavcodec/x86/sbrdsp.asm | 114 >> +++++++++++++++++++++++++++++++++++++++++++ > > Not the first time that I notice that, but memmoves are often > suboptimal using old SSE ones. > While movlhps is fine, movlps isn't, on my old core i5. You may want > to validate this with the attached patch, where storing ps_mask3 in m8 > is a gain in Win64 (the gain does not match the number of loops, but > it is still there).
I can reproduce the gains using mov{q,sd} instead of movlps, but not with the mask loaded into m8 (Tested on win64 using a k10 cpu and linux x64 using a Haswell cpu). > > Benchmarks: > x64: 6023 decicycles in g, 262108 runs, 36 skips > SSE: 3049 decicycles in g, 262130 runs, 14 skips > SSE3: 2843 decicycles in g, 262086 runs, 58 skips > movq: 2693 decicycles in g, 262117 runs, 27 skips > m8: 2648 decicycles in g, 262083 runs, 61 skips > > Thanks for doing it, I had only 3yo scraps left and no further > motivation to tackle the start/tail parts. I applied the first part for now. Thanks. > > > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel