https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118076
--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> --- In the RISCV case it is optimized because the copying of the structure into the argument area is done using 4 DImode loads + stores rather than 2 TImode loads + stores. And in that case it is actually cse1 which sees through the memory stores and so for s.x = x s.y = y s.z = z s.w = w t1 = s.x t2 = s.y t3 = s.z t4 = s.w arg[0] = t1 arg[1] = t2 arg[2] = t3 arg[3] = t4 replaces the 4 middle insns with t1 = x; t2 = y; t3 = z; t4 = w. And then dse1 optimizes the first 4 insns away.