https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87105
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |alias Target| |x86_64-*-* i?86-*-* Status|UNCONFIRMED |NEW Last reconfirmed| |2018-08-27 CC| |rguenth at gcc dot gnu.org Version|unknown |8.2.1 Blocks| |53947 Ever confirmed|0 |1 --- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Petr from comment #6) > I think the test-case can even be simplified to something like this: > > #include <algorithm> > #include <cmath> > > struct Point { > double x, y; > > void reset(double x, double y) { > this->x = x; > this->y = y; > } > }; > > void f1(Point* p, Point* a) { > p->reset(std::max(std::sqrt(p->x), a->x), > std::max(std::sqrt(p->y), a->y)); > } > > GCC is unable to vectorize it: > > [-std=c++17 -O3 -mavx2 -fno-math-errno] > f1(Point*, Point*): > vsqrtsd xmm0, xmm0, QWORD PTR [rdi+8] > vmovsd xmm1, QWORD PTR [rsi+8] > vsqrtsd xmm2, xmm2, QWORD PTR [rdi] > vmaxsd xmm1, xmm1, xmm0 > vmovsd xmm0, QWORD PTR [rsi] > vmaxsd xmm0, xmm0, xmm2 > vunpcklpd xmm0, xmm0, xmm1 > vmovups XMMWORD PTR [rdi], xmm0 > ret > > whereas clang can: > > [-std=c++17 -O3 -mavx2 -fno-math-errno] > f1(Point*, Point*): > vsqrtpd xmm0, xmmword ptr [rdi] > vmovupd xmm1, xmmword ptr [rsi] > vmaxpd xmm0, xmm1, xmm0 > vmovupd xmmword ptr [rdi], xmm0 > ret > > I think this is a much simpler test-case to start with. With -Ofast -mavx2 I get _Z2f1P5PointS0_: .LFB1123: .cfi_startproc vsqrtpd (%rdi), %xmm0 vmaxpd (%rsi), %xmm0, %xmm0 vmovups %xmm0, (%rdi) ret as said elsewhere min/max recognition is the issue here and return a < b ? b : a; isn't the same as fmax or MAX_EXPR without further constraints. For the original testcase GCC thinks some vectorization isn't profitable and even with -Ofast GCC manages to trick itself in a corner where MIN/MAX_EXPR detection fails to produce straight-line code. C++ abstraction doesn't help here - we inline the min/max functions before post-inline passes that would have cleaned up them nicely. In particular phiopt is run a bit late and is confused bu jump threading we perform in VRP. -fno-tree-vrp -Ofast -mavx2 generates straight-line non-vectorized code which the vectorizer thinks is not profitable to vectorize. In the end we are confused by the redundant stores (the vectorizer has some too simplified code to handle this): <bb 2> [local count: 1073741825]: _12 = MEM[(const double &)bez_1(D) + 8]; _13 = MEM[(const double &)bez_1(D) + 40]; iftmp.0_75 = MAX_EXPR <_12, _13>; _14 = MEM[(const double &)bez_1(D)]; _15 = MEM[(const double &)bez_1(D) + 32]; iftmp.0_74 = MAX_EXPR <_14, _15>; iftmp.1_73 = MIN_EXPR <_12, _13>; iftmp.1_72 = MIN_EXPR <_14, _15>; bBox_4(D)->x0 = iftmp.1_72; bBox_4(D)->y0 = iftmp.1_73; bBox_4(D)->x1 = iftmp.0_74; bBox_4(D)->y1 = iftmp.0_75; we vectorize up to here with -fno-vect-cost-model _6 = MEM[(double *)bez_1(D) + 16B]; _7 = MEM[(double *)bez_1(D) + 24B]; _50 = _7 * 2.0e+0; _51 = _6 * 2.0e+0; _10 = MEM[(double *)bez_1(D)]; _11 = MEM[(double *)bez_1(D) + 8B]; _8 = MEM[(double *)bez_1(D) + 32B]; _3 = _8 + _10; _9 = MEM[(double *)bez_1(D) + 40B]; _5 = _9 + _11; _46 = _5 - _50; _47 = _3 - _51; _44 = _11 - _7; ... the stores to bBox_4 alias the loads from bez_1 which due to C++ abstraction from the min/max functions are just double * loads from TBAA perspective. Given there are no visible stores to bez_1 we cannot assume any containing dynamic type here. ... iftmp.1_67 = MIN_EXPR <_23, iftmp.1_72>; bBox_4(D)->x0 = iftmp.1_67; iftmp.1_66 = MIN_EXPR <_22, iftmp.1_73>; bBox_4(D)->y0 = iftmp.1_66; iftmp.0_65 = MAX_EXPR <_23, iftmp.0_74>; bBox_4(D)->x1 = iftmp.0_65; iftmp.0_64 = MAX_EXPR <_22, iftmp.0_75>; bBox_4(D)->y1 = iftmp.0_64; return; Lifting the BB vectorization restriction is on my todo list: bool vect_analyze_data_ref_accesses (vec_info *vinfo) { ... /* Do not place the same access in the interleaving chain twice. */ if (init_b == init_prev) { gcc_assert (gimple_uid (DR_STMT (datarefs_copy[i-1])) < gimple_uid (DR_STMT (drb))); /* ??? For now we simply "drop" the later reference which is otherwise the same rather than finishing off this group. In the end we'd want to re-process duplicates forming multiple groups from the refs, likely by just collecting all candidates (including duplicates and split points below) in a vector and then process them together. */ continue; } Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations