> On 12 Jan 2021, at 21:46, Lynne <d...@lynne.ee> wrote: > > Jan 12, 2021, 19:28 by reimar.doeffin...@gmx.de: > >> It’s almost 20%. At least for this combination of >> codec and stream a large amount of time is spend in >> non-DSP functions, so even hand-written assembler >> won’t give you huge gains. >> > It's non-guaranteed 20% on a single system. It could change, and it could very > well mess up like gcc does with autovectorization, which we still explicitly > disable > because FATE fails (-fno-tree-vectorize, and I was the one who sent an RFC to > try to undo it somewhat recently. Even though it was an RFC the reaction from > devs > was quite cold).
Oh, thanks for the reminder, I thought that was gone because it seems it’s not used for clang, and MPlayer does not seem to set that. I need to compare it, however the problem with the auto-vectorization is exactly that the compiler will try to apply it to everything, which has at least 2 issues: 1) it gigantically increases the risk for bugs when it's every single loop instead of loops that we already wrote assembler for somewhere. 2) it will quite often make things worse, by vectorizing loops that are rarely iterated over more than a few times (and it needs to handle a whole lot of code to handle loop counts not a multiple of vector size) - because all too often the compiler can only take a wild guess if “width” is usually 1 or 1920, while we DO know. >>> its definitely something the compiler should >>> be able to decide on its own, >>> >> >> So you object to unlikely() macros as well? >> It’s really just giving the compiler a hint it should try, though I admit >> the configure part makes it >> look otherwise. >> > I'm more against the macro and changes to the code itself. If you can make it > work without adding a macro to individual loops or the likes of > av_cold/av_hot or > any other changes to the code, I'll be more welcoming. I expect that will just run into the same issue as the tree-vectorize... > I really _hate_ compiler hints. Take a look at the upipe source code to see > what > a cthulian monstrosity made of hint flags looks like. Every single branch had > a cold/hot macro and it was the project's coding style. It's completely > irredeemable. I guess my suggested solution would be to require proof of clearly measurable performance benefit. But I see the point that if it gets “randomly” added to loops that might turn out quite a mess. >>> Most of the loops this is added to are trivially SIMDable. >>> >> >> How many hours of effort do you consider “trivial”? >> Especially if it’s someone not experienced? >> It might be fairly trivial with intrinsics, however >> many of your counter-arguments also apply >> to intrinsics (and to a degree inline assembly). >> That’s btw not just a rhetorical question because >> I’m pretty sure I am not going to all the trouble >> to port more of the arm 32-bit assembler functions >> since it’s a huge PITA, and I was wondering if there >> was a point to even have a try with intrinsics... >> > Intrinsics and inline assembly are a whole different thing than magic > macros that tell and force the compiler what a well written compiler > should already very well know about. There are no well written compilers, in a way ;) I would also argue that most of what intrinsics do, such a compiler should figure out on its own, too. And the first time I tried intrinsics they slowed the loop down by a factor 2 because the compiler stored and loaded the value to stack between every intrinsic, so it’s not like they are not without problems. But I was actually thinking that it might be somewhat interesting to have a kind of “generic SIMD intrinsics”. Though I think I read that such a thing has already be tried, so it might just be wasted time. > I already said all that can be said here: this will halt efforts on actually > optimizing the code in exchange for naive trust in compilers. I’m not sure it will discourage it more than having to write the optimizations over and over, for Armv7 NEON, for Armv8 Linux, for Armv8 Windows, then SVE/SVE2, who knows maybe Armv9 will also need a rewrite. SSE2, AVX256, AVX512 for x86, so much stuff never gets ported to the new versions. I’d also claim anyone naively trusting in compilers is unlikely to write SIMD optimizations either way :) > New platforms will be stuck at scalar performance anyway until > the compilers for the arch are smart enough to deal with vectorization. That seems to happen a long time before someone gets around to optimising FFmpeg though. This is particularly true when it’s a new OS and not CPU architecture platform. For example macOS we are lucky enough that the assembler etc. are largely compatible to Linux. But for Windows-on-Arm there is no GNU assembler, and the Microsoft assembler needs a completely different syntax, so even the assembly we DO have just doesn’t work. Anyway, thanks for the discussion. I still think the situation with SIMD optimizations should be improved SOMEHOW, but I nothing but wild ideas on the HOW. If anyone feels the same, I’d welcome further discussion. Thanks, Reimar _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".