[Bug tree-optimization/115629] Inefficient if-convert of masked conditionals

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 02 Jul 2024 00:03:22 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629


--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #6)
> (In reply to rguent...@suse.de from comment #5)
> > > In this case, the second load is conditional on the first load mask,  
> > > which
> > > means it's already done an AND.
> > > And crucially inverting it means you also inverted both conditions.
> > > 
> > > So there are some superflous masking operations happening.  But I guess 
> > > that's
> > > a separate bug.  Shall I just add some tests here and close it and open a 
> > > new
> > > PR?
> > 
> > Not sure if that helps - do we fully understand this is a separate issue and
> > not related to how we if-convert?
> > 
> 
> if-convert looks ok to me:
> 
>   <bb 3> [local count: 955630226]:
>   # i_28 = PHI <i_25(10), 0(21)>
>   _1 = (long unsigned int) i_28;
>   _2 = _1 * 4;
>   _3 = a_16(D) + _2;
>   _4 = *_3;
>   _31 = _4 != 0;
>   _55 = _54 + _2;
>   _6 = (int *) _55;
>   _56 = ~_31;
>   _7 = .MASK_LOAD (_6, 32B, _56);
>   _22 = _7 == 0;
>   _34 = _22 & _56;
>   _58 = _57 + _2;
>   _9 = (int *) _58;
>   iftmp.0_19 = .MASK_LOAD (_9, 32B, _34);
>   _61 = _4 | _7;
>   _35 = _61 != 0;
>   _60 = _59 + _2;
>   _8 = (int *) _60;
>   iftmp.0_21 = .MASK_LOAD (_8, 32B, _35);
>   iftmp.0_12 = _34 ? iftmp.0_19 : iftmp.0_21;
>   _10 = res_23(D) + _2;
>   *_10 = iftmp.0_12;
>   i_25 = i_28 + 1;
>   if (n_15(D) > i_25)
>     goto <bb 10>; [89.00%]
>   else
>     goto <bb 11>; [11.00%]
> 
> I think what's missing here is that
> 
>   _7 = .MASK_LOAD (_6, 32B, _56);
>   _22 = _7 == 0;
>   _34 = _22 & _56;
>   iftmp.0_19 = .MASK_LOAD (_9, 32B, _34);
> 
> in itself produces an AND. namely (_7 && _34) && _56 where _56 is the loop
> mask.
> 
> my (probably poor understanding) is that the mask tracking in the vectorizer
> attempts to prevent to keep masks and their inverse live at the same time.
> 
> but that this code in this case doesn't track masks introduced by nested
> MASK_LOADS.  at least, that's my naive interpretation.

Hmm, without carrying out the math it seems like for

  iftmp.0_19 = .MASK_LOAD (_9, 32B, _34);
..
  iftmp.0_21 = .MASK_LOAD (_8, 32B, _35);
  iftmp.0_12 = _34 ? iftmp.0_19 : iftmp.0_21;

the _34 conditional and the _34 and _35 masks would make it so this
could be written with sth like

  iftmp.0_12 = iftmp.0_19 | iftmp.0_21;

but a detail might have escaped me ;)  Is that what you're trying to say
as well?  That of course still leaves maybe partly redundant mask
computations.  In the ifcvt dump it doesn't help that != 0 is sometimes
elided and sometimes not ...

IIRC the vectorizer at some point tries to optimize loop mask and mask
mask?  Maybe we need to teach that mechanism to track how mask masks are
combined from other masks via & and | to improve the masks used?

This could be also a separate mask optimization phase on the SLP
representation.  While it doesn't represent loop masking at least
it does for .MASK_LOAD and .MASK_STORE (and .COND_*).

[Bug tree-optimization/115629] Inefficient if-convert of masked conditionals

Reply via email to