https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87507
--- Comment #8 from Peter Bergner <bergner at gcc dot gnu.org> --- So Vlad is hesitant (probably rightly :) on accepting my patch. Looking closer, on BE, lower subreg2 is able to break the TImode accesses into 2 DImode accesses which helps tremendously. On LE (power8), split1 runs just before lower subreg2 and inserts swaps on the memory accesses, which confuses lower subreg, so we keep the TImode accesses and we get register pairs which are hard to allocate and leads to poor decisions in this particular case. As a hack, I moved lower subreg2 before split1 and we get the code we want. I don't think want to do that for real, so I will look at enhancing lower subreg to recognize our TImode memory ops with swaps to see whether we can still decompose them.