https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010
--- Comment #5 from Manolis Tsamis <manolis.tsamis at vrull dot eu> --- Also, I further investigated the codegen difference in the second example (zip + umlal vs umull) and it looks to be some sort of RTL ordering + combine issue. Specifically, when the we expand the RTL for the example there are some very slight ordering differences where some non-dependent insns have swapped order. On of these happens to precede a relevant vector statement and then in one case combine does the umlal transformation but in the other not. Afaik combine has some limits about the instruction window that it looks, so it looks feasible that ordering differences in RTL can later transform into major codegen differences in a number of ways. Other differences seem to come from register allocation, as you mentioned. This doesn't yet provide any useful insights whether the vectorization improvements are accidental or not.