https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80283
Marc Glisse <glisse at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |ra --- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> --- IMO this should be rtl-optimization or middle-end. The .optimized dump looks fine to me. Expansion pulls many of the vec_duplicate to the beginning of the loop (they were interleaved with the uses before), which increases live ranges a lot, and nothing moves them back closer to their use. I don't know if doing the reads early, as gcc chooses to do, can ever compensate for having to spill on this testcase, since the memory access pattern seems quite cache-friendly.