https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63175
--- Comment #24 from Martin Sebor <msebor at gcc dot gnu.org> --- (In reply to Richard Biener from comment #16) > Why is the loop bound to i != 16 / sizeof *s? The upper bound is intended to make the copied sequence fit into one vector register, irrespective of the size of the array element. The vector load and store instructions tolerate unaligned accesses and there are permute instructions that combine the contents of two vector registers into a single one to compensate for unaligned reads or writes. I'm not sure it makes sense to expect unaligned copies involving a single vector register's worth of data to be vectorized (as done in my proposed tests for char and short), but I would expect larger unaligned copies (i.e., multiples of 16 bytes) to benefit from it. In my experiments I've seen no evidence of GCC attempting to vectorize such copies but I need to do some more research to understand why. (In reply to comment #23) The test uses -maltivec and that's what I've been using as well. But I see in the Power ISA book that lxvw4x and stxvw4x are classified as VSX instructions, so perhaps they shouldn't be emitted without -mvsx. Although 5.0 doesn't emit them even with -vsx.