https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65084
Bug ID: 65084 Summary: Lack of type narrowing/widening inhibits good vectorization Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: law at redhat dot com These are testcases extracted from 47477. short a[1024], b[1024]; void foo (void) { int i; for (i = 0; i < 1024; i++) { short c = (char) a[i] + 5; long long d = (long long) b[i] + 12; a[i] = c + d; } } Compiled with -O3 -mavx2 we ought to get something similar to: short a[1024], b[1024]; void foo (void) { int i; for (i = 0; i < 1024; i++) { unsigned short c = ((short)(a[i] << 8) >> 8) + 5U; unsigned short d = b[i] + 12U; a[i] = c + d; } } though even in this case I still couldn't achieve the sign extension to be actually performed as 16-bit left + right (signed) shift, while I guess that would lead to even better code. Or look at how we vectorize: short a[1024], b[1024]; void foo (void) { int i; for (i = 0; i < 1024; i++) { unsigned char e = a[i]; short c = e + 5; long long d = (long long) b[i] + 12; a[i] = c + d; } } (note, here forwprop pass already performs type promotion, instead of converting a[i] to unsigned char and back to short, it computes a[i] & 255 in short mode) and how we could instead with type demotions: short a[1024], b[1024]; void foo (void) { int i; for (i = 0; i < 1024; i++) { unsigned short c = (a[i] & 0xff) + 5U; unsigned short d = b[i] + 12U; a[i] = c + d; } } These are all admittedly artificial testcases, but I've seen tons of loops where multiple types were vectorized and I think in some portion of those loops we could either use just a single type size, or at least decrease the number of conversions and different type sizes in the vectorized loops.