On 2017-04-05 05:44, James Almer wrote: > On 4/4/2017 10:53 PM, James Darnley wrote: >> Haswell: >> - 1.11x faster (522±0.4 vs. 469±1.8 decicycles) compared with mmxext >> >> Skylake-U: >> - 1.21x faster (671±5.5 vs. 555±1.4 decicycles) compared with mmxext > > Again, you should add an SSE2 version first, then an AVX one if it's > measurably faster than the SSE2 one.
On a Yorkfield sse2 is barely faster: 1.02x faster (728±2.1 vs. 710±3.9 decicycles). So 1 or 2 cycles On a Skylake-U sse2 is most of the speedup: 1.15x faster (661±2.2 vs 573±1.9). Then avx gains a mere 3 cycles: 547±0.5 On a Haswell sse2 provides only half the speedup: - sse2: 1.06x faster (525±2.5 vs 497±1.0 decicycles) - avx: 1.06x faster (497±1.0 vs 468±1.2 decicycles) (All on 64-bit Linux) On Nehalem and 64-bit Windows sse2 is slower: 0.92x faster (597±3.0 vs. 650±9.3 decicycles) And on that note I should probably recheck the deblock patches I pushed a little while ago. So... SSE2 for this function, yay or nay? _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel