On 9 July 2017 at 01:49, Ivan Kalvachev <ikalvac...@gmail.com> wrote:
> This should be the final work-in-progress patch. > > What's changed: > > 1. Removed macros conditional defines. The defaults seems to be > optimal on all machines that I got benchmarks from. HADDPS and PHADDD > are always slower, "BLEND"s are never slower than the emulation. > > 2. Remove SHORT_SYY_UPDATE. It is always slower. > > 3. Remove ALL_FLOAT_PRESEARCH, it is always slower. Remove the ugly > hack to use 256bit ymm with avx1 and integer operations. > > 4. Remove remaining disabled code. > > 5. Use HADDD macro from "x86util", I don't need the result in all > lanes/elements > > 6. Use v-prefix for all avx code. > > 7. Small optimization: Move some of the HSUMPS in the K!=0 branch. > > 8. Small optimization: Instead of pre-calculation 2*Y[i] and then > correcting it on exit, It is possible to use Syy/2 instead in > distortion parameter calculations. It saves few multiplications in > pre-search and sign restore loop. It however gives different > approximation of sqrt(). It's not (consistently) better or worse than > the previous approximation. > > 9. Using movdqa to load "const_int32_offsets". Wrong type might > explain why directly using mem constants is sometimes faster. > > 10. Move some code around and do minor tweaks. > --- > > I do not intend of removing "PRESEARCH_ROUNDING" and > "USE_APPROXIMATION", (while for the latter I think I will remove > method#1, I've left it this time just for consistency"). > These defines control the precision and the type of results that the > function generates. > E.g. This function can produce same results as opus functions with > "PRESEARCH_ROUNDING 0". > If you want same results as the ffmpeg improved function, then you > need "approx#0". It uses real division and is much slower on older > cpu's, but reasonably fast on anything recent. > > I've left 2 other defines. "CONST_IN_X64_REG_IS_FASTER" and > "STALL_WRITE_FORWARDING". > On Sandy Bridge and laters, "const_in_x64" has always been faster. On > my cpu it is about the same. > On Ryzen the "const_in_x64" was consistently faster in all sse/avx > variants, with about 5%. But not if "stall_write" is enabled too. > Ryzen (allegedly) has no write stalling, but that method alone is just > a few cycles faster (about 0.5% ). > > I'd like to see if the changes I've done this time, would affect the > above results. > > > The code is much cleaner and you are free to nitpick on it. > > There is something that I'm not exactly sure if I need it. > The function gets 2 integer parameters, and I am not sure > if I have to sign extend them in 64 bit more, in order to clear > the high 32 bits. These parameters should never be negative, so the > sign is not needed. > All 32bit operands should also clear the high bits. > Still I'm not sure if there is guarantee that > the high bits won't contain garbage. > > > Best Regards > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > No detectable regression from v3. Whitespace error though: .git/rebase-apply/patch:154: trailing whitespace. ; Horizontal Sum Packed Single precision float warning: 1 line adds whitespace errors. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel