[Bug target/117251] SHA3 code for PowerPC has a major slow down

meissner at gcc dot gnu.org via Gcc-bugs Mon, 21 Oct 2024 16:56:53 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117251


--- Comment #4 from Michael Meissner <meissner at gcc dot gnu.org> ---
I tracked down the commit that first made the slowdown visible:

commit 3a61ca1b9256535e1bfb19b2d46cde21f3908a5d (HEAD)
Author: Jan Hubicka <j...@suse.cz>
Date:   Thu Jul 6 18:56:22 2023 +0200

    Improve profile updates after loop-ch and cunroll

    Extend loop-ch and loop unrolling to fix profile in case the loop is
    known to not iterate at all (or iterate few times) while profile claims it
    iterates more.  While this is kind of symptomatic fix, it is best we can do
    incase profile was originally esitmated incorrectly.

    In the testcase the problematic loop is produced by vectorizer and I think
    vectorizer should know and account into its costs that vectorizer loop
and/or
    epilogue is not going to loop after the transformation.  So it would be
nice
    to fix it on that side, too.

    The patch avoids about half of profile mismatches caused by cunroll.

    Pass dump id and name            |static mismatcdynamic mismatch
                                     |in count     |in count
    107t cunrolli                    |      3    +3|        17251   +17251
    115t threadfull                  |      3      |        14376    -2875
    116t vrp                         |      5    +2|        30908   +16532
    117t dse                         |      5      |        30908
    118t dce                         |      3    -2|        17251   -13657
    127t ch                          |     13   +10|        17251
    131t dom                         |     39   +26|        17251
    133t isolate-paths               |     47    +8|        17251
    134t reassoc                     |     49    +2|        17251
    136t forwprop                    |     53    +4|       202501  +185250
    159t cddce                       |     61    +8|       216211   +13710
    161t ldist                       |     62    +1|       216211
    172t ifcvt                       |     66    +4|       373711  +157500
    173t vect                        |    143   +77|      9802097 +9428386
    176t cunroll                     |    221   +78|     15639591 +5837494
    183t loopdone                    |    218    -3|     15577640   -61951
    195t fre                         |    214    -4|     15577640
    197t dom                         |    213    -1|     16671606 +1093966
    199t threadfull                  |    215    +2|     16879581  +207975
    200t vrp                         |    217    +2|     17077750  +198169
    204t dce                         |    215    -2|     17004486   -73264
    206t sink                        |    213    -2|     17004486
    211t cddce                       |    219    +6|     17005926    +1440
    255t optimized                   |    217    -2|     17005926
    256r expand                      |    210    -7|     19571573 +2565647
    258r into_cfglayout              |    208    -2|     19571573
    275r loop2_unroll                |    212    +4|     22992432 +3420859
    291r ce2                         |    210    -2|     23011838
    312r pro_and_epilogue            |    230   +20|     23073776   +61938
    315r jump2                       |    236    +6|     27110534 +4036758
    323r bbro                        |    229    -7|     21826835 -5283699

    W/o the patch cunroll does:

    176t cunroll                     |    294  +151|126548439   +116746342

    and we end up with 291 mismatches at bbro.

    Bootstrapped/regtested x86_64-linux. Plan to commit it after the
scale_loop_frequency patch.

    gcc/ChangeLog:

            PR middle-end/25623
            * tree-ssa-loop-ch.cc (ch_base::copy_headers): Scale loop frequency
to maximal number
            of iterations determined.
            * tree-ssa-loop-ivcanon.cc (try_unroll_loop_completely): Likewise.

    gcc/testsuite/ChangeLog:

            PR middle-end/25623
            * gfortran.dg/pr25623-2.f90: New test.

However, I backed that particular patch back out of the trunk sources, and it
shows similar regressions.

Here is the scale loop patch which was mentioned above, and is the adjacent
patch.  At present, I have not tried backing out this patch:

commit d4c2e34deef8cbd81ba2ef3389fdbaf95c70e225
Author: Jan Hubicka <j...@suse.cz>
Date:   Thu Jul 6 18:51:02 2023 +0200

    Improve scale_loop_profile

    Original scale_loop_profile was implemented to only handle very simple
loops
    produced by vectorizer at that time (basically loops with only one exit and
no
    subloops). It also has not been updated to new profile-count API very
carefully.

    The function does two thigs
     1) scales down the loop profile by a given probability.
        This is useful, for example, to scale down profile after peeling when
loop
        body is executed less often than before
     2) update profile to cap iteration count by ITERATION_BOUND parameter.

    I changed ITERATION_BOUND to be actual bound on number of iterations as
    used elsewhere (i.e. number of executions of latch edge) rather then
    number of iterations + 1 as it was before.

    To do 2) one needs to do the following
      a) scale own loop profile so frquency o header is at most
         the sum of in-edge counts * (iteration_bound + 1)
      b) update loop exit probabilities so their count is the same
         as before scaling.
      c) reduce frequencies of basic blocks after loop exit

    old code did b) by setting probability to 1 / iteration_bound which is
    correctly only of the basic block containing exit executes precisely one
per
    iteration (it is not insie other conditional or inner loop).  This is fixed
    now by using set_edge_probability_and_rescale_others

    aldo c) was implemented only for special case when the exit was just before
    latch bacis block.  I now use dominance info to get right some of addional
    case.

    I still did not try to do anything for multiple exit loops, though the
    implementatoin could be generalized.

    Bootstrapped/regtested x86_64-linux.  Plan to cmmit it tonight if there
    are no complains.

    gcc/ChangeLog:

            * cfgloopmanip.cc (scale_loop_profile): Rewrite exit edge
            probability update to be safe on loops with subloops.
            Make bound parameter to be iteration bound.
            * tree-ssa-loop-ivcanon.cc (try_peel_loop): Update call
            of scale_loop_profile.
            * tree-vect-loop-manip.cc (vect_do_peeling): Likewise.

[Bug target/117251] SHA3 code for PowerPC has a major slow down

Reply via email to