https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91512

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, for me module_configure.fppized.f90 is much more problematic, compiling
longest and using most memory.  IIRC that one has long series of initialization
expressions.  And

 load CSE after reload              : 143.49 ( 41%)   0.02 (  3%) 143.52 ( 41%)
   1001 kB (  0%)

(known issue I think)

Then there's module_alloc_space_0.fppized.f90 with similar

 load CSE after reload              :  55.97 ( 61%)   0.00 (  0%)  55.97 ( 60%)
    341 kB (  0%)

and more of these... :/

And module_domain.fppized.f90 with

 machine dep reorg                  :  89.07 ( 95%)   0.02 ( 18%)  89.10 ( 95%)
     54 kB (  0%)

that's probably STV ... same for module_dm.fppized.f90

module_first_rk_step_part1.fppized.f90 compile is also slow with

 callgraph ipa passes               :  21.30 ( 14%)   0.13 (  9%)  21.44 ( 14%)
  95303 kB ( 11%)
 alias stmt walking                 :  17.93 ( 12%)   0.12 (  8%)  18.16 ( 12%)
    136 kB (  0%)
 tree FRE                           :  14.19 (  9%)   0.03 (  2%)  14.32 (  9%)
   2744 kB (  0%)
 complete unrolling                 :   6.07 (  4%)   0.02 (  1%)   6.08 (  4%)
  95401 kB ( 11%)
 load CSE after reload              :  33.62 ( 22%)   0.01 (  1%)  33.63 ( 22%)
    174 kB (  0%)

and solve_em.fppized.f90 might be similar.

Looking at .original of module_first_rk_step_part1.fppized.f90 it is
the decomposed "grid" that gets passed along causing all the re-packs.
So the caller has

  SUBROUTINE first_rk_step_part1 (   grid , ...
    TYPE ( domain ), INTENT(INOUT) :: grid
...
        CALL phy_prep ( config_flags,                                    &
                        grid%mut, grid%muu, grid%muv, grid%u_2,          &
                        grid%v_2, grid%p, grid%pb, grid%alt,             &
                        grid%ph_2, grid%phb, grid%t_2, grid%tsk, moist,
num_moist,   &
                        grid%rho,th_phy, p_phy, pi_phy, grid%u_phy, grid%v_phy,
     &
                        p8w, t_phy, t8w, grid%z, grid%z_at_w, dz8w,      &
                        grid%p_hyd, grid%p_hyd_w, grid%dnw,              &
                        grid%fnm, grid%fnp, grid%znw, grid%p_top,        &
                        grid%rthraten,                                   &
                        grid%rthblten, grid%rublten, grid%rvblten,       &
                        grid%rqvblten, grid%rqcblten, grid%rqiblten,     &
                        grid%rucuten,  grid%rvcuten,  grid%rthcuten,     &
                        grid%rqvcuten, grid%rqccuten, grid%rqrcuten,     &
                        grid%rqicuten, grid%rqscuten,                    &
                        grid%rushten,  grid%rvshten,  grid%rthshten,     &
                        grid%rqvshten, grid%rqcshten, grid%rqrshten,     &
                        grid%rqishten, grid%rqsshten, grid%rqgshten,     &
                        grid%rthften,  grid%rqvften,                     &
                        grid%RUNDGDTEN, grid%RVNDGDTEN, grid%RTHNDGDTEN, &
                        grid%RPHNDGDTEN,grid%RQVNDGDTEN, grid%RMUNDGDTEN,&
!jdf
                        grid%landmask,grid%xland,                 &
!jdf
                        ids, ide, jds, jde, kds, kde,                    &
                        ims, ime, jms, jme, kms, kme,                    &
                        grid%i_start(ij), grid%i_end(ij),                &
                        grid%j_start(ij), grid%j_end(ij),                &
                        k_start, k_end                                   )

and more of that while TYPE (domain) having

real      ,DIMENSION(:,:,:)   ,POINTER   :: rucuten
real      ,DIMENSION(:,:)     ,POINTER   :: mut
...

so here are the assumed-shaped arrays.  Note the packing is done
conditional like

    contiguous.11171 = (D.83839.dim[0].stride == 1 && D.83839.dim[1].stride ==
D.83839.dim[0].stride * ((D.83839.dim[0].ubound - D.83839.dim[0].lbound) + 1))
&& D.83839.dim[2].stride == D.83839.dim[1].stride * ((D.83839.dim[1].ubound -
D.83839.dim[1].lbound) + 1);
    if (__builtin_expect ((integer(kind=8)) contiguous.11171, 1, 50))
      { 
        arg_ptr.11170 = (real(kind=4)[0:] * restrict) grid->u_phy.data;
      }
    else
      { 
        D.83779 = (real(kind=4)[0:] *) grid->u_phy.data;
... repack ...
      }

so this simply exposes quite a number of loop nests in this file while
there were no loops but only calls before (repack + the actual calls).

Given calls might be inlined it seems to be worth expanding the repacking
inline.  IIRC the original motivation of adding the inline expansion
was exactly such a case, correct?

So a testcase for the "regression" would be a function with a single
call stmt with a _lot_ of arguments all in need of repacking.

Reply via email to