On January 15, 2021 5:16:40 PM GMT+01:00, Qing Zhao <qing.z...@oracle.com> 
wrote:
>
>
>> On Jan 15, 2021, at 2:11 AM, Richard Biener <rguent...@suse.de>
>wrote:
>> 
>> 
>> 
>> On Thu, 14 Jan 2021, Qing Zhao wrote:
>> 
>>> Hi, 
>>> More data on code size and compilation time with CPU2017:
>>> ********Compilation time data:   the numbers are the slowdown
>against the
>>> default “no”:
>>> benchmarks  A/no D/no
>>>                         
>>> 500.perlbench_r 5.19% 1.95%
>>> 502.gcc_r 0.46% -0.23%
>>> 505.mcf_r 0.00% 0.00%
>>> 520.omnetpp_r 0.85% 0.00%
>>> 523.xalancbmk_r 0.79% -0.40%
>>> 525.x264_r -4.48% 0.00%
>>> 531.deepsjeng_r 16.67% 16.67%
>>> 541.leela_r  0.00%  0.00%
>>> 557.xz_r 0.00%  0.00%
>>>                         
>>> 507.cactuBSSN_r 1.16% 0.58%
>>> 508.namd_r 9.62% 8.65%
>>> 510.parest_r 0.48% 1.19%
>>> 511.povray_r 3.70% 3.70%
>>> 519.lbm_r 0.00% 0.00%
>>> 521.wrf_r 0.05% 0.02%
>>> 526.blender_r 0.33% 1.32%
>>> 527.cam4_r -0.93% -0.93%
>>> 538.imagick_r 1.32% 3.95%
>>> 544.nab_r  0.00% 0.00%
>>> From the above data, looks like that the compilation time impact
>>> from implementation A and D are almost the same.
>>> *******code size data: the numbers are the code size increase
>against the
>>> default “no”:
>>> benchmarks A/no D/no
>>>                         
>>> 500.perlbench_r 2.84% 0.34%
>>> 502.gcc_r 2.59% 0.35%
>>> 505.mcf_r 3.55% 0.39%
>>> 520.omnetpp_r 0.54% 0.03%
>>> 523.xalancbmk_r 0.36%  0.39%
>>> 525.x264_r 1.39% 0.13%
>>> 531.deepsjeng_r 2.15% -1.12%
>>> 541.leela_r 0.50% -0.20%
>>> 557.xz_r 0.31% 0.13%
>>>                         
>>> 507.cactuBSSN_r 5.00% -0.01%
>>> 508.namd_r 3.64% -0.07%
>>> 510.parest_r 1.12% 0.33%
>>> 511.povray_r 4.18% 1.16%
>>> 519.lbm_r 8.83% 6.44%
>>> 521.wrf_r 0.08% 0.02%
>>> 526.blender_r 1.63% 0.45%
>>> 527.cam4_r  0.16% 0.06%
>>> 538.imagick_r 3.18% -0.80%
>>> 544.nab_r 5.76% -1.11%
>>> Avg 2.52% 0.36%
>>> From the above data, the implementation D is always better than A,
>it’s a
>>> surprising to me, not sure what’s the reason for this.
>> 
>> D probably inhibits most interesting loop transforms (check SPEC FP
>> performance).
>
>The call to .DEFERRED_INIT is marked as ECF_CONST:
>
>/* A function to represent an artifical initialization to an
>uninitialized
>   automatic variable. The first argument is the variable itself, the
>   second argument is the initialization type.  */
>DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW,
>NULL)
>
>So, I assume that such const call should minimize the impact to loop
>optimizations. But yes, it will still inhibit some of the loop
>transformations.
>
>>  It will also most definitely disallow SRA which, when
>> an aggregate is not completely elided, tends to grow code.
>
>Make sense to me. 
>
>The run-time performance data for D and A are actually very similar as
>I posted in the previous email (I listed it here for convenience)
>
>Run-time performance overhead with A and D:
>
>benchmarks             A / no  D /no
>
>500.perlbench_r        1.25%   1.25%
>502.gcc_r              0.68%   1.80%
>505.mcf_r              0.68%   0.14%
>520.omnetpp_r  4.83%   4.68%
>523.xalancbmk_r        0.18%   1.96%
>525.x264_r             1.55%   2.07%
>531.deepsjeng_ 11.57%  11.85%
>541.leela_r            0.64%   0.80%
>557.xz_                         -0.41% -0.41%
>
>507.cactuBSSN_r        0.44%   0.44%
>508.namd_r             0.34%   0.34%
>510.parest_r           0.17%   0.25%
>511.povray_r           56.57%  57.27%
>519.lbm_r              0.00%   0.00%
>521.wrf_r                       -0.28% -0.37%
>526.blender_r          16.96%  17.71%
>527.cam4_r             0.70%   0.53%
>538.imagick_r          2.40%   2.40%
>544.nab_r              0.00%   -0.65%
>
>avg                            5.17%   5.37%
>
>Especially for the SPEC FP benchmarks, I didn’t see too much
>performance difference between A and D. 
>I guess that the RTL optimizations might be enough to get rid of most
>of the overhead introduced by the additional initialization. 
>
>> 
>>> ********stack usage data, I added -fstack-usage to the compilation
>line when
>>> compiling CPU2017 benchmarks. And all the *.su files were generated
>for each
>>> of the modules.
>>> Since there a lot of such files, and the stack size information are
>embedded
>>> in each of the files.  I just picked up one benchmark 511.povray to
>>> check. Which is the one that 
>>> has the most runtime overhead when adding initialization (both A and
>D). 
>>> I identified all the *.su files that are different between A and D
>and do a
>>> diff on those *.su files, and looks like that the stack size is much
>higher
>>> with D than that with A, for example:
>>> $ diff build_base_auto_init.D.0000/bbox.su
>>> build_base_auto_init.A.0000/bbox.su5c5
>>> < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>>> pov::BBOX_TREE**&, long int*, long int, long int) 160 static
>>> ---
>>> > bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>>> pov::BBOX_TREE**&, long int*, long int, long int) 96 static
>>> $ diff build_base_auto_init.D.0000/image.su
>>> build_base_auto_init.A.0000/image.su
>>> 9c9
>>> < image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*,
>double*) 624
>>> static
>>> ---
>>> > image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*,
>double*) 272
>>> static
>>> ….
>>> Looks like that implementation D has more stack size impact than A. 
>>> Do you have any insight on what the reason for this?
>> 
>> D will keep all initialized aggregates as aggregates and live which
>> means stack will be allocated for it.  With A the usual optimizations
>> to reduce stack usage can be applied.
>
>I checked the routine “poverties::bump_map” in 511.povray_r since it
>has a lot stack increase 
>due to implementation D, by examine the IR immediate before RTL
>expansion phase.  
>(image.cpp.244t.optimized), I found that we have the following
>additional statements for the array elements:
>
>void  pov::bump_map (double * EPoint, struct TNORMAL * Tnormal, double
>* normal)
>{
>…
>  double p3[3];
>  double p2[3];
>  double p1[3];
>  float colour3[5];
>  float colour2[5];
>  float colour1[5];
>…
>   # DEBUG BEGIN_STMT
>  colour1 = .DEFERRED_INIT (colour1, 2);
>  colour2 = .DEFERRED_INIT (colour2, 2);
>  colour3 = .DEFERRED_INIT (colour3, 2);
>  # DEBUG BEGIN_STMT
>  MEM <double> [(double[3] *)&p1] = p1$0_144(D);
>  MEM <double> [(double[3] *)&p1 + 8B] = p1$1_135(D);
>  MEM <double> [(double[3] *)&p1 + 16B] = p1$2_138(D);
>  p1 = .DEFERRED_INIT (p1, 2);
>  # DEBUG D#12 => MEM <double> [(double[3] *)&p1]
>  # DEBUG p1$0 => D#12
>  # DEBUG D#11 => MEM <double> [(double[3] *)&p1 + 8B]
>  # DEBUG p1$1 => D#11
>  # DEBUG D#10 => MEM <double> [(double[3] *)&p1 + 16B]
>  # DEBUG p1$2 => D#10
>  MEM <double> [(double[3] *)&p2] = p2$0_109(D);
>  MEM <double> [(double[3] *)&p2 + 8B] = p2$1_111(D);
>  MEM <double> [(double[3] *)&p2 + 16B] = p2$2_254(D);
>  p2 = .DEFERRED_INIT (p2, 2);
>  # DEBUG D#9 => MEM <double> [(double[3] *)&p2]
>  # DEBUG p2$0 => D#9
>  # DEBUG D#8 => MEM <double> [(double[3] *)&p2 + 8B]
>  # DEBUG p2$1 => D#8
>  # DEBUG D#7 => MEM <double> [(double[3] *)&p2 + 16B]
>  # DEBUG p2$2 => D#7
>  MEM <double> [(double[3] *)&p3] = p3$0_256(D);
>  MEM <double> [(double[3] *)&p3 + 8B] = p3$1_258(D);
>  MEM <double> [(double[3] *)&p3 + 16B] = p3$2_260(D);
>  p3 = .DEFERRED_INIT (p3, 2);
>  ….
>}
>
>I guess that the above “MEM <double>….. = …” are the ones that make the
>differences. Which phase introduced them?

Looks like SRA. But you can just dump all and grep for the first occurrence. 


>> 
>>> Let me know if you have any comments and suggestions.
>> 
>> First of all I would check whether the prototype implementations
>> work as expected.
>I have done such check with small testing cases already, checking the
>IR generated with the implementation A or D, mainly
>Focus on *.c.006t.gimple.  and *.c.*t.expand, all worked as expected. 
>
>For the CPU2017, for example as the above, I also checked the IR for
>both A and D, looks like all worked as expected.
>
>Thanks. 
>
>Qing
>> 
>> Richard.
>> 
>> 
>>> thanks.
>>> Qing
>>>      On Jan 13, 2021, at 1:39 AM, Richard Biener <rguent...@suse.de>
>>>      wrote:
>>> 
>>>      On Tue, 12 Jan 2021, Qing Zhao wrote:
>>> 
>>>            Hi, 
>>> 
>>>            Just check in to see whether you have any comments
>>>            and suggestions on this:
>>> 
>>>            FYI, I have been continue with Approach D
>>>            implementation since last week:
>>> 
>>>            D. Adding  calls to .DEFFERED_INIT during
>>>            gimplification, expand the .DEFFERED_INIT during
>>>            expand to
>>>            real initialization. Adjusting uninitialized pass
>>>            with the new refs with “.DEFFERED_INIT”.
>>> 
>>>            For the remaining work of Approach D:
>>> 
>>>            ** complete the implementation of
>>>            -ftrivial-auto-var-init=pattern;
>>>            ** complete the implementation of uninitialized
>>>            warnings maintenance work for D. 
>>> 
>>>            I have completed the uninitialized warnings
>>>            maintenance work for D.
>>>            And finished partial of the
>>>            -ftrivial-auto-var-init=pattern implementation. 
>>> 
>>>            The following are remaining work of Approach D:
>>> 
>>>              ** -ftrivial-auto-var-init=pattern for VLA;
>>>              **add a new attribute for variable:
>>>            __attribute((uninitialized)
>>>            the marked variable is uninitialized intentionaly
>>>            for performance purpose.
>>>              ** adding complete testing cases;
>>> 
>>>            Please let me know if you have any objection on my
>>>            current decision on implementing approach D. 
>>> 
>>>      Did you do any analysis on how stack usage and code size are
>>>      changed 
>>>      with approach D?  How does compile-time behave (we could gobble
>>>      up
>>>      lots of .DEFERRED_INIT calls I guess)?
>>> 
>>>      Richard.
>>> 
>>>            Thanks a lot for your help.
>>> 
>>>            Qing
>>> 
>>>                  On Jan 5, 2021, at 1:05 PM, Qing Zhao
>>>                  via Gcc-patches
>>>                  <gcc-patches@gcc.gnu.org> wrote:
>>> 
>>>                  Hi,
>>> 
>>>                  This is an update for our previous
>>>                  discussion. 
>>> 
>>>                  1. I implemented the following two
>>>                  different implementations in the latest
>>>                  upstream gcc:
>>> 
>>>                  A. Adding real initialization during
>>>                  gimplification, not maintain the
>>>                  uninitialized warnings.
>>> 
>>>                  D. Adding  calls to .DEFFERED_INIT
>>>                  during gimplification, expand the
>>>                  .DEFFERED_INIT during expand to
>>>                  real initialization. Adjusting
>>>                  uninitialized pass with the new refs
>>>                  with “.DEFFERED_INIT”.
>>> 
>>>                  Note, in this initial implementation,
>>>                  ** I ONLY implement
>>>                  -ftrivial-auto-var-init=zero, the
>>>                  implementation of
>>>                  -ftrivial-auto-var-init=pattern 
>>>                     is not done yet.  Therefore, the
>>>                  performance data is only about
>>>                  -ftrivial-auto-var-init=zero. 
>>> 
>>>                  ** I added an temporary  option
>>>                  -fauto-var-init-approach=A|B|C|D  to
>>>                  choose implementation A or D for 
>>>                     runtime performance study.
>>>                  ** I didn’t finish the uninitialized
>>>                  warnings maintenance work for D. (That
>>>                  might take more time than I expected). 
>>> 
>>>                  2. I collected runtime data for CPU2017
>>>                  on a x86 machine with this new gcc for
>>>                  the following 3 cases:
>>> 
>>>                  no: default. (-g -O2 -march=native )
>>>                  A:  default +
>>>                   -ftrivial-auto-var-init=zero
>>>                  -fauto-var-init-approach=A 
>>>                  D:  default +
>>>                   -ftrivial-auto-var-init=zero
>>>                  -fauto-var-init-approach=D 
>>> 
>>>                  And then compute the slowdown data for
>>>                  both A and D as following:
>>> 
>>>                  benchmarks A / no D /no
>>> 
>>>                  500.perlbench_r 1.25% 1.25%
>>>                  502.gcc_r 0.68% 1.80%
>>>                  505.mcf_r 0.68% 0.14%
>>>                  520.omnetpp_r 4.83% 4.68%
>>>                  523.xalancbmk_r 0.18% 1.96%
>>>                  525.x264_r 1.55% 2.07%
>>>                  531.deepsjeng_ 11.57% 11.85%
>>>                  541.leela_r 0.64% 0.80%
>>>                  557.xz_  -0.41% -0.41%
>>> 
>>>                  507.cactuBSSN_r 0.44% 0.44%
>>>                  508.namd_r 0.34% 0.34%
>>>                  510.parest_r 0.17% 0.25%
>>>                  511.povray_r 56.57% 57.27%
>>>                  519.lbm_r 0.00% 0.00%
>>>                  521.wrf_r  -0.28% -0.37%
>>>                  526.blender_r 16.96% 17.71%
>>>                  527.cam4_r 0.70% 0.53%
>>>                  538.imagick_r 2.40% 2.40%
>>>                  544.nab_r 0.00% -0.65%
>>> 
>>>                  avg 5.17% 5.37%
>>> 
>>>                  From the above data, we can see that in
>>>                  general, the runtime performance
>>>                  slowdown for 
>>>                  implementation A and D are similar for
>>>                  individual benchmarks.
>>> 
>>>                  There are several benchmarks that have
>>>                  significant slowdown with the new added
>>>                  initialization for both
>>>                  A and D, for example, 511.povray_r,
>>>                  526.blender_, and 531.deepsjeng_r, I
>>>                  will try to study a little bit
>>>                  more on what kind of new initializations
>>>                  introduced such slowdown. 
>>> 
>>>                  From the current study so far, I think
>>>                  that approach D should be good enough
>>>                  for our final implementation. 
>>>                  So, I will try to finish approach D with
>>>                  the following remaining work
>>> 
>>>                      ** complete the implementation of
>>>                  -ftrivial-auto-var-init=pattern;
>>>                      ** complete the implementation of
>>>                  uninitialized warnings maintenance work
>>>                  for D. 
>>> 
>>>                  Let me know if you have any comments and
>>>                  suggestions on my current and future
>>>                  work.
>>> 
>>>                  Thanks a lot for your help.
>>> 
>>>                  Qing
>>> 
>>>                        On Dec 9, 2020, at 10:18 AM,
>>>                        Qing Zhao via Gcc-patches
>>>                        <gcc-patches@gcc.gnu.org>
>>>                        wrote:
>>> 
>>>                        The following are the
>>>                        approaches I will implement
>>>                        and compare:
>>> 
>>>                        Our final goal is to keep
>>>                        the uninitialized warning
>>>                        and minimize the run-time
>>>                        performance cost.
>>> 
>>>                        A. Adding real
>>>                        initialization during
>>>                        gimplification, not maintain
>>>                        the uninitialized warnings.
>>>                        B. Adding real
>>>                        initialization during
>>>                        gimplification, marking them
>>>                        with “artificial_init”. 
>>>                          Adjusting uninitialized
>>>                        pass, maintaining the
>>>                        annotation, making sure the
>>>                        real init not
>>>                          Deleted from the fake
>>>                        init. 
>>>                        C.  Marking the DECL for an
>>>                        uninitialized auto variable
>>>                        as “no_explicit_init” during
>>>                        gimplification,
>>>                           maintain this
>>>                        “no_explicit_init” bit till
>>>                        after
>>>                        pass_late_warn_uninitialized,
>>>                        or till pass_expand, 
>>>                           add real initialization
>>>                        for all DECLs that are
>>>                        marked with
>>>                        “no_explicit_init”.
>>>                        D. Adding .DEFFERED_INIT
>>>                        during gimplification,
>>>                        expand the .DEFFERED_INIT
>>>                        during expand to
>>>                          real initialization.
>>>                        Adjusting uninitialized pass
>>>                        with the new refs with
>>>                        “.DEFFERED_INIT”.
>>> 
>>>                        In the above, approach A
>>>                        will be the one that have
>>>                        the minimum run-time cost,
>>>                        will be the base for the
>>>                        performance
>>>                        comparison. 
>>> 
>>>                        I will implement approach D
>>>                        then, this one is expected
>>>                        to have the most run-time
>>>                        overhead among the above
>>>                        list, but
>>>                        Implementation should be the
>>>                        cleanest among B, C, D.
>>>                        Let’s see how much more
>>>                        performance overhead this
>>>                        approach
>>>                        will be. If the data is
>>>                        good, maybe we can avoid the
>>>                        effort to implement B, and
>>>                        C. 
>>> 
>>>                        If the performance of D is
>>>                        not good, I will implement B
>>>                        or C at that time.
>>> 
>>>                        Let me know if you have any
>>>                        comment or suggestions.
>>> 
>>>                        Thanks.
>>> 
>>>                        Qing
>>> 
>>>      -- 
>>>      Richard Biener <rguent...@suse.de>
>>>      SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
>>>      Nuernberg,
>>>      Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Reply via email to