https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809
Andrew Waterman <andrew at sifive dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrew at sifive dot com

--- Comment #2 from Andrew Waterman <andrew at sifive dot com> ---
To respond to some of Palmer's points:

In general, doing a single reduction at the end will perform better than doing
multiple reductions.  For the same total number of additions, sum reductions
tend to be slower (or at least no faster) than regular vector adds.

On some microarchitectures, vcpop.m results in a loss-of-decoupling event,
since it's consumed by the scalar unit.  To get reasonable performance on those
uarches, you need to use maximal LMUL to amortize the loss-of-decoupling event
over a greater amount of vector work.  (The alternative is to unroll the loop
such that each vcpop.m writes a different x-register, but that's far messier
than using large LMUL.)

Reply via email to