https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66558
--- Comment #1 from alalaw01 at gcc dot gnu.org --- Strategy could be similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54013 except finding the last bit rather than the first (and no jump out of the loop). That is, in the loop body: v_pred = (a[i] > threshold) for each element if (any element of v_pred set) v_save_pred = v_pred v_save_i = {i, i+1, i+2, i+3} v_last = v_save_i // or a different expression, as is assigned to 'last' and in the epilogue, last = v_last[ rightmost set element in v_save_pred ] where the rightmost set element could be done via narrow/trunc and 'bsr' (on x86), or more generally, idx = reduc_max_expr (v_save_pred ? v_save_i : 0) // any reduction will do here, as only one element will be non-zero: last = reduc_max_expr (v_save_i == idx ? v_last : 0) // or alternatively: last = v_last[ idx & (vec_num_elts - 1) ]