>if one removes the crippling >-fno-tree-vectorize Yes, I think a config option to turn this flag on (like the unsafe bitstream reader) would be good. Defaulting to off by default if it doesn't break anything for at least a few people (and compilers) who test it. It's not a big performance impact but every little bit counts nowadays.
On 3 November 2015 at 03:10, Ganesh Ajjanagadde <gajja...@mit.edu> wrote: > On Mon, Nov 2, 2015 at 9:32 PM, Ganesh Ajjanagadde <gajja...@mit.edu> > wrote: > > On Mon, Nov 2, 2015 at 6:49 PM, Ganesh Ajjanagadde <gajja...@mit.edu> > wrote: > >> On Mon, Nov 2, 2015 at 6:37 PM, wm4 <nfx...@googlemail.com> wrote: > >>> On Mon, 2 Nov 2015 14:49:54 -0500 > >>> Ganesh Ajjanagadde <gajjanaga...@gmail.com> wrote: > >>> > >>>> This improves accuracy for the bessel function, and this in turn > should > >>>> improve the quality of the Kaiser window. > >>> > >>> > >>> "Should"? Does it or does it not? If you don't know, why the patch? > >> > >> It improves the window in the sense of mathematically matching the > >> Kaiser window expression due to the improved bessel function accuracy. > >> That claim I have verified and placed in the commit message with > >> evidence. > >> > >> What that translates into in terms of resampling accuracy, I don't > >> know. Normally, such things do help reduce the noise floor, but I > >> don't know an easy way to test via FATE or associated tools. I put > >> that caveat in the bottom lines of the patch. > > > > Turns out the init speed is definitely improved as well (~20% boost > > with default settings). > > I did a cheap trick of calling build filter 1000 times to get a large > > number of runs. > > Results (x86-64, Haswell, GNU/Linux): > > > > test: fate-swr-resample-dblp-44100-2626 > > > > new: > > 995894468 decicycles in build_filter(loop 1000), 256 runs, 0 > skips > > 1029719302 decicycles in build_filter(loop 1000), 512 runs, 0 > skips > > 984101131 decicycles in build_filter(loop 1000), 1024 runs, 0 > skips > > > > old: > > 1250020763 decicycles in build_filter(loop 1000), 256 runs, 0 > skips > > 1246353282 decicycles in build_filter(loop 1000), 512 runs, 0 > skips > > 1220017565 decicycles in build_filter(loop 1000), 1024 runs, 0 > skips > > > > Thus, this patch has both things going for it luckily. Will leave to > > the maintainer (Michael I believe) to give details of accuracy > > benefits as translated to the actual resampling if easily testable, > > and I will add the perf numbers to the message. > > One last comment on performance (this is something I have discussed > with Timothy privately), if one removes the crippling > -fno-tree-vectorize, FATE still passes on my GCC 5.2 setup, and one > gets further ~5 % perf improvement here: > 949318408 decicycles in build_filter(loop 1000), 256 runs, 0 skips > 948795082 decicycles in build_filter(loop 1000), 512 runs, 0 skips > 928925076 decicycles in build_filter(loop 1000), 1024 runs, 0 skips > > I am sure a lot of other places may benefit. Yes, I know it was buggy > on older GCC versions, etc, but I see no reason to cripple users with > modern toolchains. > The commit 973859f5230e77beea7bb59dc081870689d6d191 is ancient and has > a "sloppy" commit message. > > "It provides no speed benefit" is not true, and even back then was > never backed up by hard numbers, though the commit would have been ok > on non x86 at the time, possibly even now. As can be seen clearly from > https://ffmpeg.org/pipermail/ffmpeg-devel/2009-July/069586.html, much > of Mans's evidence came from non x86 architectures which are Mans's > expertise. In the absence of any complaints, the change was committed > and x86 people have suffered from it. Yes, particular versions may be > buggy, but such things should be checked more carefully especially > since this is "free performance". > > Mans's argument was that adventurous people could add it back in. > Needless to say, it was not documented anywhere, and casual users like > me (and I suspect some others here) were completely unaware of this. > It is quite obscure and easy to miss unless one digs through configure > (or config.mak), or finds out from someone else. > > I also recall (and this prompted the private conversation with > Timothy) that Timothy's memcpy change from the loop actually yield no > difference with a half decent auto-vectoriser. It is a low hanging > fruit in my view, and I have posted a perf number above. > > At the very least, we should inform users of this low hanging thing > e.g by printing out during configure an "info" message - they can try > out an "adventurous" build with this, test out FATE, and if it passes, > go ahead with a new build. > Or we could have a "dev option" for --enable-vectorize. > Or at a more ambitious scale, we could collect configs where this will > work, test for them, and only disable otherwise. > There are a number of possibilities, for which a separate thread is better. > > > > >> > >>> _______________________________________________ > >>> ffmpeg-devel mailing list > >>> ffmpeg-devel@ffmpeg.org > >>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel