[Bug tree-optimization/87105] Autovectorization [X86, SSE2, AVX2, DoublePrecision]

rguenth at gcc dot gnu.org Mon, 27 Aug 2018 00:46:24 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87105


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |alias
             Target|                            |x86_64-*-* i?86-*-*
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2018-08-27
                 CC|                            |rguenth at gcc dot gnu.org
            Version|unknown                     |8.2.1
             Blocks|                            |53947
     Ever confirmed|0                           |1

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Petr from comment #6)
> I think the test-case can even be simplified to something like this:
> 
> #include <algorithm>
> #include <cmath>
> 
> struct Point {
>   double x, y;
> 
>   void reset(double x, double y) {
>     this->x = x;
>     this->y = y;
>   }
> };
> 
> void f1(Point* p, Point* a) {
>   p->reset(std::max(std::sqrt(p->x), a->x),
>            std::max(std::sqrt(p->y), a->y));
> }
> 
> GCC is unable to vectorize it:
> 
>   [-std=c++17 -O3 -mavx2 -fno-math-errno]
>   f1(Point*, Point*):
>     vsqrtsd xmm0, xmm0, QWORD PTR [rdi+8]
>     vmovsd  xmm1, QWORD PTR [rsi+8]
>     vsqrtsd xmm2, xmm2, QWORD PTR [rdi]
>     vmaxsd  xmm1, xmm1, xmm0
>     vmovsd  xmm0, QWORD PTR [rsi]
>     vmaxsd  xmm0, xmm0, xmm2
>     vunpcklpd       xmm0, xmm0, xmm1
>     vmovups XMMWORD PTR [rdi], xmm0
>     ret
> 
> whereas clang can:
> 
>   [-std=c++17 -O3 -mavx2 -fno-math-errno]
>   f1(Point*, Point*):
>     vsqrtpd xmm0, xmmword ptr [rdi]
>     vmovupd xmm1, xmmword ptr [rsi]
>     vmaxpd  xmm0, xmm1, xmm0
>     vmovupd xmmword ptr [rdi], xmm0
>     ret
> 
> I think this is a much simpler test-case to start with.

With -Ofast -mavx2 I get

_Z2f1P5PointS0_:
.LFB1123:
        .cfi_startproc
        vsqrtpd (%rdi), %xmm0
        vmaxpd  (%rsi), %xmm0, %xmm0
        vmovups %xmm0, (%rdi)
        ret

as said elsewhere min/max recognition is the issue here and

  return a < b ? b : a; 

isn't the same as fmax or MAX_EXPR without further constraints.


For the original testcase GCC thinks some vectorization isn't profitable
and even with -Ofast GCC manages to trick itself in a corner where
MIN/MAX_EXPR detection fails to produce straight-line code.  C++
abstraction doesn't help here - we inline the min/max functions before
post-inline passes that would have cleaned up them nicely.  In particular
phiopt is run a bit late and is confused bu jump threading we perform in VRP.
-fno-tree-vrp -Ofast -mavx2 generates straight-line non-vectorized code
which the vectorizer thinks is not profitable to vectorize.

In the end we are confused by the redundant stores (the vectorizer has
some too simplified code to handle this):

  <bb 2> [local count: 1073741825]:
  _12 = MEM[(const double &)bez_1(D) + 8];
  _13 = MEM[(const double &)bez_1(D) + 40];
  iftmp.0_75 = MAX_EXPR <_12, _13>;
  _14 = MEM[(const double &)bez_1(D)];
  _15 = MEM[(const double &)bez_1(D) + 32];
  iftmp.0_74 = MAX_EXPR <_14, _15>;
  iftmp.1_73 = MIN_EXPR <_12, _13>;
  iftmp.1_72 = MIN_EXPR <_14, _15>;
  bBox_4(D)->x0 = iftmp.1_72;
  bBox_4(D)->y0 = iftmp.1_73;
  bBox_4(D)->x1 = iftmp.0_74;
  bBox_4(D)->y1 = iftmp.0_75;

we vectorize up to here with -fno-vect-cost-model

  _6 = MEM[(double *)bez_1(D) + 16B];
  _7 = MEM[(double *)bez_1(D) + 24B];
  _50 = _7 * 2.0e+0;
  _51 = _6 * 2.0e+0;
  _10 = MEM[(double *)bez_1(D)];
  _11 = MEM[(double *)bez_1(D) + 8B];
  _8 = MEM[(double *)bez_1(D) + 32B];
  _3 = _8 + _10;
  _9 = MEM[(double *)bez_1(D) + 40B];
  _5 = _9 + _11;
  _46 = _5 - _50;
  _47 = _3 - _51;
  _44 = _11 - _7;
...

the stores to bBox_4 alias the loads from bez_1 which due to C++ abstraction
from the min/max functions are just double * loads from TBAA perspective.
Given there are no visible stores to bez_1 we cannot assume any containing
dynamic type here.

...
  iftmp.1_67 = MIN_EXPR <_23, iftmp.1_72>;
  bBox_4(D)->x0 = iftmp.1_67;
  iftmp.1_66 = MIN_EXPR <_22, iftmp.1_73>;
  bBox_4(D)->y0 = iftmp.1_66;
  iftmp.0_65 = MAX_EXPR <_23, iftmp.0_74>;
  bBox_4(D)->x1 = iftmp.0_65;
  iftmp.0_64 = MAX_EXPR <_22, iftmp.0_75>;
  bBox_4(D)->y1 = iftmp.0_64;
  return;

Lifting the BB vectorization restriction is on my todo list:

bool
vect_analyze_data_ref_accesses (vec_info *vinfo)
{
...
          /* Do not place the same access in the interleaving chain twice.  */
          if (init_b == init_prev)
            {
              gcc_assert (gimple_uid (DR_STMT (datarefs_copy[i-1]))
                          < gimple_uid (DR_STMT (drb)));
              /* ???  For now we simply "drop" the later reference which is
                 otherwise the same rather than finishing off this group.
                 In the end we'd want to re-process duplicates forming
                 multiple groups from the refs, likely by just collecting
                 all candidates (including duplicates and split points
                 below) in a vector and then process them together.  */
              continue;
            }


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/87105] Autovectorization [X86, SSE2, AVX2, DoublePrecision]

Reply via email to