https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89226
--- Comment #1 from Marc Glisse <glisse at gcc dot gnu.org> --- The optimized dump for copy1 looks like *to_2(D) = *from_3(D); so we get essentially memcpy, while copy2 has _4 = MEM[(const struct foo512 &)from_3(D)].a; MEM[(struct foo512 *)to_2(D)].a = _4; _5 = MEM[(const struct foo512 &)from_3(D)].b; MEM[(struct foo512 *)to_2(D)].b = _5; which we expand literally. I agree that we should generate the same code for both (ideally we would reach expand with essentially the same GIMPLE representation, although I am not sure how). A question is whether the memcpy expansion is optimal for that target. It could be that as long as you are only copying a rather small object, it isn't worth switching to larger registers which cause a drop in the processor frequency. However the code generated is not impacted if I use other AVX instructions nearby. -Os can make us generate 'rep movsl' for copy1.