https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440

--- Comment #2 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Feng Xue <f...@gcc.gnu.org>:

https://gcc.gnu.org/g:178cc419512f7e358f88dfe2336625aa99cd7438

commit r15-2096-g178cc419512f7e358f88dfe2336625aa99cd7438
Author: Feng Xue <f...@os.amperecomputing.com>
Date:   Wed May 29 17:22:36 2024 +0800

    vect: Support multiple lane-reducing operations for loop reduction
[PR114440]

    For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction,
current
    vectorizer could only handle the pattern if the reduction chain does not
    contain other operation, no matter the other is normal or lane-reducing.

    This patches removes some constraints in reduction analysis to allow
multiple
    arbitrary lane-reducing operations with mixed input vectypes in a loop
    reduction chain. For example:

       int sum = 1;
       for (i)
         {
           sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
           sum += w[i];               // widen-sum <vector(16) char>
           sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
         }

    The vector size is 128-bit vectorization factor is 16. Reduction statements
    would be transformed as:

       vector<4> int sum_v0 = { 0, 0, 0, 1 };
       vector<4> int sum_v1 = { 0, 0, 0, 0 };
       vector<4> int sum_v2 = { 0, 0, 0, 0 };
       vector<4> int sum_v3 = { 0, 0, 0, 0 };

       for (i / 16)
         {
           sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
           sum_v1 = sum_v1;  // copy
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy

           sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
           sum_v1 = sum_v1;  // copy
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy

           sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
           sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy
         }

        sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1

    2024-03-22 Feng Xue <f...@os.amperecomputing.com>

    gcc/
            PR tree-optimization/114440
            * tree-vectorizer.h (vectorizable_lane_reducing): New function
            declaration.
            * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
            vectorizable_lane_reducing to analyze lane-reducing operation.
            * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost
computation
            code related to emulated_mixed_dot_prod.
            (vectorizable_lane_reducing): New function.
            (vectorizable_reduction): Allow multiple lane-reducing operations
in
            loop reduction. Move some original lane-reducing related code to
            vectorizable_lane_reducing.
            (vect_transform_reduction): Adjust comments with updated example.

    gcc/testsuite/
            PR tree-optimization/114440
            * gcc.dg/vect/vect-reduc-chain-1.c
            * gcc.dg/vect/vect-reduc-chain-2.c
            * gcc.dg/vect/vect-reduc-chain-3.c
            * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
            * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
            * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
            * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
            * gcc.dg/vect/vect-reduc-dot-slp-1.c
  • [Bug tree-optimization/114440] ... cvs-commit at gcc dot gnu.org via Gcc-bugs

Reply via email to