https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115675
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Last reconfirmed| |2025-01-24 Status|UNCONFIRMED |NEW Blocks| |53947 CC| |rguenth at gcc dot gnu.org --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. Note the BB vectorizer simply cobbles up as much operations as it can - as we now have truncv4hiv4qi we happily vectorize the hi->qi truncation, but we fail to also cover the loads given half of the lanes have a shift. We do have some heuristics that avoid operations on all "externs", but only if it appears to be uniform. /* If we have all children of a child built up from uniform scalars or does more than one possibly expensive vector construction then just throw that away, causing it built up from scalars. The exception is the SLP node for the vector store. */ ... if (all_uniform_p || n_vector_builds > 1 || (n_vector_builds == children.length () && is_a <gphi *> (stmt_info->stmt))) { /* Roll back. */ in this case !all_uniform_p and n_vector_builds == 1. There's an exception for a "copy", but I think scrapping all conversions would be bad (like int<->float converts are OK to vectorize). I think this is a case where SLP discovery should maybe have used tem = load >> { 8, 0, 8, 0 }; tem2 = (vector(4) char) tem; store = tem2; thus the missed insert of neutral operand operations issue. This is of course still short of detecting the bswap. I'll note that without vectorizing the bswap pass doesn't detect the bswap either, so we do rely on the vector CTOR detection of the bswap pass it seems. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations