https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108
--- Comment #10 from Jeff Hammond <jeff.science at gmail dot com> --- Thanks for the feedback. I agree that it is a huge amount of work to optimize this. For what it's worth, GCC and Clang perform about the same. Unfortunately, I do not have the means to evaluate IBM XLF, which may have an optimized implementation of this (https://rd.springer.com/chapter/10.1007%2F978-3-642-32820-6_23), so I do not have a good sense of what is achievable here, other than what I hand-optimize. I have no objection if you want to close this as invalid.