https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92772
Bug ID: 92772
Summary: wrong code vectorizing masked max
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Severity: critical
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ams at gcc dot gnu.org
Target Milestone: ---
The testcase pr65947-10.c fails on amdgcn because there are more vector lanes
than there is data, and the algorithm created doesn't allow for this. (Actually
there's also a backend pattern missing, but I have a patch for that I'll commit
shortly.)
Here's the affected loop:
float last = 0;
for (int i = 0; i < 32; i++)
if (a[i] < min_v)
last = a[i];
Which produces the following code (long lines shortened).
vect_cst__33 = {min_v_11(D), .... min_v_11(D)};
vect__4.16_32 = .MASK_LOAD (a_10(D), 4B, { -1, [...] -1, 0, [...] 0 });
vect_last_6.17_34 = VEC_COND_EXPR <vect__4.16_32 < vect_cst__33,
vect__4.16_32, { 0.0, [...] 0.0 }>;
_38 = VEC_COND_EXPR <vect__4.16_32 < vect_cst__33, { 1, 2, [...] 64 }, { 0,
[...] 0 }>;
_40 = .REDUC_MAX (_38);
_41 = {_40, _40, [...] _40};
_43 = VEC_COND_EXPR <_38 == _41, vect_last_6.17_34, { 0.0, [...] 0.0 }>;
_44 = VIEW_CONVERT_EXPR<vector(64) unsigned int>(_43);
_45 = .REDUC_MAX (_44);
_46 = VIEW_CONVERT_EXPR<float>(_45);
return _46;
In English:
1. Do a masked load of 32 elements (into 64-lane register). Loads "0.0" into
the spare lanes.
2. Compare the all 64-lanes against "min_v". Label all the "true" lanes with
the lane number.
3. Use a reduction to find the greatest numbered "true" lane.
4. Zero all the loaded values apart from the one in the greatest lane.
5. Use a reduction to find the value of the lane that isn't zeroed.
That's slightly tortuous when we could just do a vec_extract on "_40", but
that's an aside.
The problem is in step 2: the spare lanes contain 0.0, which means that
comparing them against "min_v" returns "true". This means that the algorithm
always finds "last = a[63]" which isn't a real value and therefore always ends
up being "0.0".