> On Jan 15, 2021, at 2:11 AM, Richard Biener <rguent...@suse.de> wrote:
>
>
>
> On Thu, 14 Jan 2021, Qing Zhao wrote:
>
>> Hi,
>> More data on code size and compilation time with CPU2017:
>> ********Compilation time data: the numbers are the slowdown against the
>> default “no”:
>> benchmarks A/no D/no
>>
>> 500.perlbench_r 5.19% 1.95%
>> 502.gcc_r 0.46% -0.23%
>> 505.mcf_r 0.00% 0.00%
>> 520.omnetpp_r 0.85% 0.00%
>> 523.xalancbmk_r 0.79% -0.40%
>> 525.x264_r -4.48% 0.00%
>> 531.deepsjeng_r 16.67% 16.67%
>> 541.leela_r 0.00% 0.00%
>> 557.xz_r 0.00% 0.00%
>>
>> 507.cactuBSSN_r 1.16% 0.58%
>> 508.namd_r 9.62% 8.65%
>> 510.parest_r 0.48% 1.19%
>> 511.povray_r 3.70% 3.70%
>> 519.lbm_r 0.00% 0.00%
>> 521.wrf_r 0.05% 0.02%
>> 526.blender_r 0.33% 1.32%
>> 527.cam4_r -0.93% -0.93%
>> 538.imagick_r 1.32% 3.95%
>> 544.nab_r 0.00% 0.00%
>> From the above data, looks like that the compilation time impact
>> from implementation A and D are almost the same.
>> *******code size data: the numbers are the code size increase against the
>> default “no”:
>> benchmarks A/no D/no
>>
>> 500.perlbench_r 2.84% 0.34%
>> 502.gcc_r 2.59% 0.35%
>> 505.mcf_r 3.55% 0.39%
>> 520.omnetpp_r 0.54% 0.03%
>> 523.xalancbmk_r 0.36% 0.39%
>> 525.x264_r 1.39% 0.13%
>> 531.deepsjeng_r 2.15% -1.12%
>> 541.leela_r 0.50% -0.20%
>> 557.xz_r 0.31% 0.13%
>>
>> 507.cactuBSSN_r 5.00% -0.01%
>> 508.namd_r 3.64% -0.07%
>> 510.parest_r 1.12% 0.33%
>> 511.povray_r 4.18% 1.16%
>> 519.lbm_r 8.83% 6.44%
>> 521.wrf_r 0.08% 0.02%
>> 526.blender_r 1.63% 0.45%
>> 527.cam4_r 0.16% 0.06%
>> 538.imagick_r 3.18% -0.80%
>> 544.nab_r 5.76% -1.11%
>> Avg 2.52% 0.36%
>> From the above data, the implementation D is always better than A, it’s a
>> surprising to me, not sure what’s the reason for this.
>
> D probably inhibits most interesting loop transforms (check SPEC FP
> performance).
The call to .DEFERRED_INIT is marked as ECF_CONST:
/* A function to represent an artifical initialization to an uninitialized
automatic variable. The first argument is the variable itself, the
second argument is the initialization type. */
DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
So, I assume that such const call should minimize the impact to loop
optimizations. But yes, it will still inhibit some of the loop transformations.
> It will also most definitely disallow SRA which, when
> an aggregate is not completely elided, tends to grow code.
Make sense to me.
The run-time performance data for D and A are actually very similar as I posted
in the previous email (I listed it here for convenience)
Run-time performance overhead with A and D:
benchmarks A / no D /no
500.perlbench_r 1.25% 1.25%
502.gcc_r 0.68% 1.80%
505.mcf_r 0.68% 0.14%
520.omnetpp_r 4.83% 4.68%
523.xalancbmk_r 0.18% 1.96%
525.x264_r 1.55% 2.07%
531.deepsjeng_ 11.57% 11.85%
541.leela_r 0.64% 0.80%
557.xz_ -0.41% -0.41%
507.cactuBSSN_r 0.44% 0.44%
508.namd_r 0.34% 0.34%
510.parest_r 0.17% 0.25%
511.povray_r 56.57% 57.27%
519.lbm_r 0.00% 0.00%
521.wrf_r -0.28% -0.37%
526.blender_r 16.96% 17.71%
527.cam4_r 0.70% 0.53%
538.imagick_r 2.40% 2.40%
544.nab_r 0.00% -0.65%
avg 5.17% 5.37%
Especially for the SPEC FP benchmarks, I didn’t see too much performance
difference between A and D.
I guess that the RTL optimizations might be enough to get rid of most of the
overhead introduced by the additional initialization.
>
>> ********stack usage data, I added -fstack-usage to the compilation line when
>> compiling CPU2017 benchmarks. And all the *.su files were generated for each
>> of the modules.
>> Since there a lot of such files, and the stack size information are embedded
>> in each of the files. I just picked up one benchmark 511.povray to
>> check. Which is the one that
>> has the most runtime overhead when adding initialization (both A and D).
>> I identified all the *.su files that are different between A and D and do a
>> diff on those *.su files, and looks like that the stack size is much higher
>> with D than that with A, for example:
>> $ diff build_base_auto_init.D.0000/bbox.su
>> build_base_auto_init.A.0000/bbox.su5c5
>> < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>> pov::BBOX_TREE**&, long int*, long int, long int) 160 static
>> ---
>> > bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>> pov::BBOX_TREE**&, long int*, long int, long int) 96 static
>> $ diff build_base_auto_init.D.0000/image.su
>> build_base_auto_init.A.0000/image.su
>> 9c9
>> < image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 624
>> static
>> ---
>> > image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 272
>> static
>> ….
>> Looks like that implementation D has more stack size impact than A.
>> Do you have any insight on what the reason for this?
>
> D will keep all initialized aggregates as aggregates and live which
> means stack will be allocated for it. With A the usual optimizations
> to reduce stack usage can be applied.
I checked the routine “poverties::bump_map” in 511.povray_r since it has a lot
stack increase
due to implementation D, by examine the IR immediate before RTL expansion
phase.
(image.cpp.244t.optimized), I found that we have the following additional
statements for the array elements:
void pov::bump_map (double * EPoint, struct TNORMAL * Tnormal, double * normal)
{
…
double p3[3];
double p2[3];
double p1[3];
float colour3[5];
float colour2[5];
float colour1[5];
…
# DEBUG BEGIN_STMT
colour1 = .DEFERRED_INIT (colour1, 2);
colour2 = .DEFERRED_INIT (colour2, 2);
colour3 = .DEFERRED_INIT (colour3, 2);
# DEBUG BEGIN_STMT
MEM <double> [(double[3] *)&p1] = p1$0_144(D);
MEM <double> [(double[3] *)&p1 + 8B] = p1$1_135(D);
MEM <double> [(double[3] *)&p1 + 16B] = p1$2_138(D);
p1 = .DEFERRED_INIT (p1, 2);
# DEBUG D#12 => MEM <double> [(double[3] *)&p1]
# DEBUG p1$0 => D#12
# DEBUG D#11 => MEM <double> [(double[3] *)&p1 + 8B]
# DEBUG p1$1 => D#11
# DEBUG D#10 => MEM <double> [(double[3] *)&p1 + 16B]
# DEBUG p1$2 => D#10
MEM <double> [(double[3] *)&p2] = p2$0_109(D);
MEM <double> [(double[3] *)&p2 + 8B] = p2$1_111(D);
MEM <double> [(double[3] *)&p2 + 16B] = p2$2_254(D);
p2 = .DEFERRED_INIT (p2, 2);
# DEBUG D#9 => MEM <double> [(double[3] *)&p2]
# DEBUG p2$0 => D#9
# DEBUG D#8 => MEM <double> [(double[3] *)&p2 + 8B]
# DEBUG p2$1 => D#8
# DEBUG D#7 => MEM <double> [(double[3] *)&p2 + 16B]
# DEBUG p2$2 => D#7
MEM <double> [(double[3] *)&p3] = p3$0_256(D);
MEM <double> [(double[3] *)&p3 + 8B] = p3$1_258(D);
MEM <double> [(double[3] *)&p3 + 16B] = p3$2_260(D);
p3 = .DEFERRED_INIT (p3, 2);
….
}
I guess that the above “MEM <double>….. = …” are the ones that make the
differences. Which phase introduced them?
>
>> Let me know if you have any comments and suggestions.
>
> First of all I would check whether the prototype implementations
> work as expected.
I have done such check with small testing cases already, checking the IR
generated with the implementation A or D, mainly
Focus on *.c.006t.gimple. and *.c.*t.expand, all worked as expected.
For the CPU2017, for example as the above, I also checked the IR for both A and
D, looks like all worked as expected.
Thanks.
Qing
>
> Richard.
>
>
>> thanks.
>> Qing
>> On Jan 13, 2021, at 1:39 AM, Richard Biener <rguent...@suse.de>
>> wrote:
>>
>> On Tue, 12 Jan 2021, Qing Zhao wrote:
>>
>> Hi,
>>
>> Just check in to see whether you have any comments
>> and suggestions on this:
>>
>> FYI, I have been continue with Approach D
>> implementation since last week:
>>
>> D. Adding calls to .DEFFERED_INIT during
>> gimplification, expand the .DEFFERED_INIT during
>> expand to
>> real initialization. Adjusting uninitialized pass
>> with the new refs with “.DEFFERED_INIT”.
>>
>> For the remaining work of Approach D:
>>
>> ** complete the implementation of
>> -ftrivial-auto-var-init=pattern;
>> ** complete the implementation of uninitialized
>> warnings maintenance work for D.
>>
>> I have completed the uninitialized warnings
>> maintenance work for D.
>> And finished partial of the
>> -ftrivial-auto-var-init=pattern implementation.
>>
>> The following are remaining work of Approach D:
>>
>> ** -ftrivial-auto-var-init=pattern for VLA;
>> **add a new attribute for variable:
>> __attribute((uninitialized)
>> the marked variable is uninitialized intentionaly
>> for performance purpose.
>> ** adding complete testing cases;
>>
>> Please let me know if you have any objection on my
>> current decision on implementing approach D.
>>
>> Did you do any analysis on how stack usage and code size are
>> changed
>> with approach D? How does compile-time behave (we could gobble
>> up
>> lots of .DEFERRED_INIT calls I guess)?
>>
>> Richard.
>>
>> Thanks a lot for your help.
>>
>> Qing
>>
>> On Jan 5, 2021, at 1:05 PM, Qing Zhao
>> via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Hi,
>>
>> This is an update for our previous
>> discussion.
>>
>> 1. I implemented the following two
>> different implementations in the latest
>> upstream gcc:
>>
>> A. Adding real initialization during
>> gimplification, not maintain the
>> uninitialized warnings.
>>
>> D. Adding calls to .DEFFERED_INIT
>> during gimplification, expand the
>> .DEFFERED_INIT during expand to
>> real initialization. Adjusting
>> uninitialized pass with the new refs
>> with “.DEFFERED_INIT”.
>>
>> Note, in this initial implementation,
>> ** I ONLY implement
>> -ftrivial-auto-var-init=zero, the
>> implementation of
>> -ftrivial-auto-var-init=pattern
>> is not done yet. Therefore, the
>> performance data is only about
>> -ftrivial-auto-var-init=zero.
>>
>> ** I added an temporary option
>> -fauto-var-init-approach=A|B|C|D to
>> choose implementation A or D for
>> runtime performance study.
>> ** I didn’t finish the uninitialized
>> warnings maintenance work for D. (That
>> might take more time than I expected).
>>
>> 2. I collected runtime data for CPU2017
>> on a x86 machine with this new gcc for
>> the following 3 cases:
>>
>> no: default. (-g -O2 -march=native )
>> A: default +
>> -ftrivial-auto-var-init=zero
>> -fauto-var-init-approach=A
>> D: default +
>> -ftrivial-auto-var-init=zero
>> -fauto-var-init-approach=D
>>
>> And then compute the slowdown data for
>> both A and D as following:
>>
>> benchmarks A / no D /no
>>
>> 500.perlbench_r 1.25% 1.25%
>> 502.gcc_r 0.68% 1.80%
>> 505.mcf_r 0.68% 0.14%
>> 520.omnetpp_r 4.83% 4.68%
>> 523.xalancbmk_r 0.18% 1.96%
>> 525.x264_r 1.55% 2.07%
>> 531.deepsjeng_ 11.57% 11.85%
>> 541.leela_r 0.64% 0.80%
>> 557.xz_ -0.41% -0.41%
>>
>> 507.cactuBSSN_r 0.44% 0.44%
>> 508.namd_r 0.34% 0.34%
>> 510.parest_r 0.17% 0.25%
>> 511.povray_r 56.57% 57.27%
>> 519.lbm_r 0.00% 0.00%
>> 521.wrf_r -0.28% -0.37%
>> 526.blender_r 16.96% 17.71%
>> 527.cam4_r 0.70% 0.53%
>> 538.imagick_r 2.40% 2.40%
>> 544.nab_r 0.00% -0.65%
>>
>> avg 5.17% 5.37%
>>
>> From the above data, we can see that in
>> general, the runtime performance
>> slowdown for
>> implementation A and D are similar for
>> individual benchmarks.
>>
>> There are several benchmarks that have
>> significant slowdown with the new added
>> initialization for both
>> A and D, for example, 511.povray_r,
>> 526.blender_, and 531.deepsjeng_r, I
>> will try to study a little bit
>> more on what kind of new initializations
>> introduced such slowdown.
>>
>> From the current study so far, I think
>> that approach D should be good enough
>> for our final implementation.
>> So, I will try to finish approach D with
>> the following remaining work
>>
>> ** complete the implementation of
>> -ftrivial-auto-var-init=pattern;
>> ** complete the implementation of
>> uninitialized warnings maintenance work
>> for D.
>>
>> Let me know if you have any comments and
>> suggestions on my current and future
>> work.
>>
>> Thanks a lot for your help.
>>
>> Qing
>>
>> On Dec 9, 2020, at 10:18 AM,
>> Qing Zhao via Gcc-patches
>> <gcc-patches@gcc.gnu.org>
>> wrote:
>>
>> The following are the
>> approaches I will implement
>> and compare:
>>
>> Our final goal is to keep
>> the uninitialized warning
>> and minimize the run-time
>> performance cost.
>>
>> A. Adding real
>> initialization during
>> gimplification, not maintain
>> the uninitialized warnings.
>> B. Adding real
>> initialization during
>> gimplification, marking them
>> with “artificial_init”.
>> Adjusting uninitialized
>> pass, maintaining the
>> annotation, making sure the
>> real init not
>> Deleted from the fake
>> init.
>> C. Marking the DECL for an
>> uninitialized auto variable
>> as “no_explicit_init” during
>> gimplification,
>> maintain this
>> “no_explicit_init” bit till
>> after
>> pass_late_warn_uninitialized,
>> or till pass_expand,
>> add real initialization
>> for all DECLs that are
>> marked with
>> “no_explicit_init”.
>> D. Adding .DEFFERED_INIT
>> during gimplification,
>> expand the .DEFFERED_INIT
>> during expand to
>> real initialization.
>> Adjusting uninitialized pass
>> with the new refs with
>> “.DEFFERED_INIT”.
>>
>> In the above, approach A
>> will be the one that have
>> the minimum run-time cost,
>> will be the base for the
>> performance
>> comparison.
>>
>> I will implement approach D
>> then, this one is expected
>> to have the most run-time
>> overhead among the above
>> list, but
>> Implementation should be the
>> cleanest among B, C, D.
>> Let’s see how much more
>> performance overhead this
>> approach
>> will be. If the data is
>> good, maybe we can avoid the
>> effort to implement B, and
>> C.
>>
>> If the performance of D is
>> not good, I will implement B
>> or C at that time.
>>
>> Let me know if you have any
>> comment or suggestions.
>>
>> Thanks.
>>
>> Qing
>>
>> --
>> Richard Biener <rguent...@suse.de>
>> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
>> Nuernberg,
>> Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)