https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118019
--- Comment #10 from Robin Dapp <rdapp at gcc dot gnu.org> --- Ah I see - the actual vector code isn't even that bad and the vec_constructs aren't either. The problem is rather that we have slow unaligned (scalar) access with the default tune model. Thus we need to load 8 individual uint8s to actually load one long - of course the vec_init costs underestimate what's really happening then. If we enable fast unaligned scalar access we chose a different vectorization scheme so the issue above is not relevant anymore... Another issue is that we use the wrong vectype for costing the vec_construct. Will prepare a patch for that.