> On Jan 15, 2021, at 11:22 AM, Richard Biener <rguent...@suse.de> wrote:
>
> On January 15, 2021 5:16:40 PM GMT+01:00, Qing Zhao <qing.z...@oracle.com
> <mailto:qing.z...@oracle.com>> wrote:
>>
>>
>>> On Jan 15, 2021, at 2:11 AM, Richard Biener <rguent...@suse.de>
>> wrote:
>>>
>>>
>>>
>>> On Thu, 14 Jan 2021, Qing Zhao wrote:
>>>
>>>> Hi,
>>>> More data on code size and compilation time with CPU2017:
>>>> ********Compilation time data: the numbers are the slowdown
>> against the
>>>> default “no”:
>>>> benchmarks A/no D/no
>>>>
>>>> 500.perlbench_r 5.19% 1.95%
>>>> 502.gcc_r 0.46% -0.23%
>>>> 505.mcf_r 0.00% 0.00%
>>>> 520.omnetpp_r 0.85% 0.00%
>>>> 523.xalancbmk_r 0.79% -0.40%
>>>> 525.x264_r -4.48% 0.00%
>>>> 531.deepsjeng_r 16.67% 16.67%
>>>> 541.leela_r 0.00% 0.00%
>>>> 557.xz_r 0.00% 0.00%
>>>>
>>>> 507.cactuBSSN_r 1.16% 0.58%
>>>> 508.namd_r 9.62% 8.65%
>>>> 510.parest_r 0.48% 1.19%
>>>> 511.povray_r 3.70% 3.70%
>>>> 519.lbm_r 0.00% 0.00%
>>>> 521.wrf_r 0.05% 0.02%
>>>> 526.blender_r 0.33% 1.32%
>>>> 527.cam4_r -0.93% -0.93%
>>>> 538.imagick_r 1.32% 3.95%
>>>> 544.nab_r 0.00% 0.00%
>>>> From the above data, looks like that the compilation time impact
>>>> from implementation A and D are almost the same.
>>>> *******code size data: the numbers are the code size increase
>> against the
>>>> default “no”:
>>>> benchmarks A/no D/no
>>>>
>>>> 500.perlbench_r 2.84% 0.34%
>>>> 502.gcc_r 2.59% 0.35%
>>>> 505.mcf_r 3.55% 0.39%
>>>> 520.omnetpp_r 0.54% 0.03%
>>>> 523.xalancbmk_r 0.36% 0.39%
>>>> 525.x264_r 1.39% 0.13%
>>>> 531.deepsjeng_r 2.15% -1.12%
>>>> 541.leela_r 0.50% -0.20%
>>>> 557.xz_r 0.31% 0.13%
>>>>
>>>> 507.cactuBSSN_r 5.00% -0.01%
>>>> 508.namd_r 3.64% -0.07%
>>>> 510.parest_r 1.12% 0.33%
>>>> 511.povray_r 4.18% 1.16%
>>>> 519.lbm_r 8.83% 6.44%
>>>> 521.wrf_r 0.08% 0.02%
>>>> 526.blender_r 1.63% 0.45%
>>>> 527.cam4_r 0.16% 0.06%
>>>> 538.imagick_r 3.18% -0.80%
>>>> 544.nab_r 5.76% -1.11%
>>>> Avg 2.52% 0.36%
>>>> From the above data, the implementation D is always better than A,
>> it’s a
>>>> surprising to me, not sure what’s the reason for this.
>>>
>>> D probably inhibits most interesting loop transforms (check SPEC FP
>>> performance).
>>
>> The call to .DEFERRED_INIT is marked as ECF_CONST:
>>
>> /* A function to represent an artifical initialization to an
>> uninitialized
>> automatic variable. The first argument is the variable itself, the
>> second argument is the initialization type. */
>> DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW,
>> NULL)
>>
>> So, I assume that such const call should minimize the impact to loop
>> optimizations. But yes, it will still inhibit some of the loop
>> transformations.
>>
>>> It will also most definitely disallow SRA which, when
>>> an aggregate is not completely elided, tends to grow code.
>>
>> Make sense to me.
>>
>> The run-time performance data for D and A are actually very similar as
>> I posted in the previous email (I listed it here for convenience)
>>
>> Run-time performance overhead with A and D:
>>
>> benchmarks A / no D /no
>>
>> 500.perlbench_r 1.25% 1.25%
>> 502.gcc_r 0.68% 1.80%
>> 505.mcf_r 0.68% 0.14%
>> 520.omnetpp_r 4.83% 4.68%
>> 523.xalancbmk_r 0.18% 1.96%
>> 525.x264_r 1.55% 2.07%
>> 531.deepsjeng_ 11.57% 11.85%
>> 541.leela_r 0.64% 0.80%
>> 557.xz_ -0.41% -0.41%
>>
>> 507.cactuBSSN_r 0.44% 0.44%
>> 508.namd_r 0.34% 0.34%
>> 510.parest_r 0.17% 0.25%
>> 511.povray_r 56.57% 57.27%
>> 519.lbm_r 0.00% 0.00%
>> 521.wrf_r -0.28% -0.37%
>> 526.blender_r 16.96% 17.71%
>> 527.cam4_r 0.70% 0.53%
>> 538.imagick_r 2.40% 2.40%
>> 544.nab_r 0.00% -0.65%
>>
>> avg 5.17% 5.37%
>>
>> Especially for the SPEC FP benchmarks, I didn’t see too much
>> performance difference between A and D.
>> I guess that the RTL optimizations might be enough to get rid of most
>> of the overhead introduced by the additional initialization.
>>
>>>
>>>> ********stack usage data, I added -fstack-usage to the compilation
>> line when
>>>> compiling CPU2017 benchmarks. And all the *.su files were generated
>> for each
>>>> of the modules.
>>>> Since there a lot of such files, and the stack size information are
>> embedded
>>>> in each of the files. I just picked up one benchmark 511.povray to
>>>> check. Which is the one that
>>>> has the most runtime overhead when adding initialization (both A and
>> D).
>>>> I identified all the *.su files that are different between A and D
>> and do a
>>>> diff on those *.su files, and looks like that the stack size is much
>> higher
>>>> with D than that with A, for example:
>>>> $ diff build_base_auto_init.D.0000/bbox.su
>>>> build_base_auto_init.A.0000/bbox.su5c5
>>>> < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>>>> pov::BBOX_TREE**&, long int*, long int, long int) 160 static
>>>> ---
>>>>> bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>>>> pov::BBOX_TREE**&, long int*, long int, long int) 96 static
>>>> $ diff build_base_auto_init.D.0000/image.su
>>>> build_base_auto_init.A.0000/image.su
>>>> 9c9
>>>> < image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*,
>> double*) 624
>>>> static
>>>> ---
>>>>> image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*,
>> double*) 272
>>>> static
>>>> ….
>>>> Looks like that implementation D has more stack size impact than A.
>>>> Do you have any insight on what the reason for this?
>>>
>>> D will keep all initialized aggregates as aggregates and live which
>>> means stack will be allocated for it. With A the usual optimizations
>>> to reduce stack usage can be applied.
>>
>> I checked the routine “poverties::bump_map” in 511.povray_r since it
>> has a lot stack increase
>> due to implementation D, by examine the IR immediate before RTL
>> expansion phase.
>> (image.cpp.244t.optimized), I found that we have the following
>> additional statements for the array elements:
>>
>> void pov::bump_map (double * EPoint, struct TNORMAL * Tnormal, double
>> * normal)
>> {
>> …
>> double p3[3];
>> double p2[3];
>> double p1[3];
>> float colour3[5];
>> float colour2[5];
>> float colour1[5];
>> …
>> # DEBUG BEGIN_STMT
>> colour1 = .DEFERRED_INIT (colour1, 2);
>> colour2 = .DEFERRED_INIT (colour2, 2);
>> colour3 = .DEFERRED_INIT (colour3, 2);
>> # DEBUG BEGIN_STMT
>> MEM <double> [(double[3] *)&p1] = p1$0_144(D);
>> MEM <double> [(double[3] *)&p1 + 8B] = p1$1_135(D);
>> MEM <double> [(double[3] *)&p1 + 16B] = p1$2_138(D);
>> p1 = .DEFERRED_INIT (p1, 2);
>> # DEBUG D#12 => MEM <double> [(double[3] *)&p1]
>> # DEBUG p1$0 => D#12
>> # DEBUG D#11 => MEM <double> [(double[3] *)&p1 + 8B]
>> # DEBUG p1$1 => D#11
>> # DEBUG D#10 => MEM <double> [(double[3] *)&p1 + 16B]
>> # DEBUG p1$2 => D#10
>> MEM <double> [(double[3] *)&p2] = p2$0_109(D);
>> MEM <double> [(double[3] *)&p2 + 8B] = p2$1_111(D);
>> MEM <double> [(double[3] *)&p2 + 16B] = p2$2_254(D);
>> p2 = .DEFERRED_INIT (p2, 2);
>> # DEBUG D#9 => MEM <double> [(double[3] *)&p2]
>> # DEBUG p2$0 => D#9
>> # DEBUG D#8 => MEM <double> [(double[3] *)&p2 + 8B]
>> # DEBUG p2$1 => D#8
>> # DEBUG D#7 => MEM <double> [(double[3] *)&p2 + 16B]
>> # DEBUG p2$2 => D#7
>> MEM <double> [(double[3] *)&p3] = p3$0_256(D);
>> MEM <double> [(double[3] *)&p3 + 8B] = p3$1_258(D);
>> MEM <double> [(double[3] *)&p3 + 16B] = p3$2_260(D);
>> p3 = .DEFERRED_INIT (p3, 2);
>> ….
>> }
>>
>> I guess that the above “MEM <double>….. = …” are the ones that make the
>> differences. Which phase introduced them?
>
> Looks like SRA. But you can just dump all and grep for the first occurrence.
Yes, looks like that SRA is the one:
image.cpp.035t.esra: MEM <double> [(double[3] *)&p1] = p1$0_195(D);
image.cpp.035t.esra: MEM <double> [(double[3] *)&p1 + 8B] = p1$1_182(D);
image.cpp.035t.esra: MEM <double> [(double[3] *)&p1 + 16B] = p1$2_185(D);
Qing
>
>
>>>
>>>> Let me know if you have any comments and suggestions.
>>>
>>> First of all I would check whether the prototype implementations
>>> work as expected.
>> I have done such check with small testing cases already, checking the
>> IR generated with the implementation A or D, mainly
>> Focus on *.c.006t.gimple. and *.c.*t.expand, all worked as expected.
>>
>> For the CPU2017, for example as the above, I also checked the IR for
>> both A and D, looks like all worked as expected.
>>
>> Thanks.
>>
>> Qing
>>>
>>> Richard.
>>>
>>>
>>>> thanks.
>>>> Qing
>>>> On Jan 13, 2021, at 1:39 AM, Richard Biener <rguent...@suse.de>
>>>> wrote:
>>>>
>>>> On Tue, 12 Jan 2021, Qing Zhao wrote:
>>>>
>>>> Hi,
>>>>
>>>> Just check in to see whether you have any comments
>>>> and suggestions on this:
>>>>
>>>> FYI, I have been continue with Approach D
>>>> implementation since last week:
>>>>
>>>> D. Adding calls to .DEFFERED_INIT during
>>>> gimplification, expand the .DEFFERED_INIT during
>>>> expand to
>>>> real initialization. Adjusting uninitialized pass
>>>> with the new refs with “.DEFFERED_INIT”.
>>>>
>>>> For the remaining work of Approach D:
>>>>
>>>> ** complete the implementation of
>>>> -ftrivial-auto-var-init=pattern;
>>>> ** complete the implementation of uninitialized
>>>> warnings maintenance work for D.
>>>>
>>>> I have completed the uninitialized warnings
>>>> maintenance work for D.
>>>> And finished partial of the
>>>> -ftrivial-auto-var-init=pattern implementation.
>>>>
>>>> The following are remaining work of Approach D:
>>>>
>>>> ** -ftrivial-auto-var-init=pattern for VLA;
>>>> **add a new attribute for variable:
>>>> __attribute((uninitialized)
>>>> the marked variable is uninitialized intentionaly
>>>> for performance purpose.
>>>> ** adding complete testing cases;
>>>>
>>>> Please let me know if you have any objection on my
>>>> current decision on implementing approach D.
>>>>
>>>> Did you do any analysis on how stack usage and code size are
>>>> changed
>>>> with approach D? How does compile-time behave (we could gobble
>>>> up
>>>> lots of .DEFERRED_INIT calls I guess)?
>>>>
>>>> Richard.
>>>>
>>>> Thanks a lot for your help.
>>>>
>>>> Qing
>>>>
>>>> On Jan 5, 2021, at 1:05 PM, Qing Zhao
>>>> via Gcc-patches
>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> This is an update for our previous
>>>> discussion.
>>>>
>>>> 1. I implemented the following two
>>>> different implementations in the latest
>>>> upstream gcc:
>>>>
>>>> A. Adding real initialization during
>>>> gimplification, not maintain the
>>>> uninitialized warnings.
>>>>
>>>> D. Adding calls to .DEFFERED_INIT
>>>> during gimplification, expand the
>>>> .DEFFERED_INIT during expand to
>>>> real initialization. Adjusting
>>>> uninitialized pass with the new refs
>>>> with “.DEFFERED_INIT”.
>>>>
>>>> Note, in this initial implementation,
>>>> ** I ONLY implement
>>>> -ftrivial-auto-var-init=zero, the
>>>> implementation of
>>>> -ftrivial-auto-var-init=pattern
>>>> is not done yet. Therefore, the
>>>> performance data is only about
>>>> -ftrivial-auto-var-init=zero.
>>>>
>>>> ** I added an temporary option
>>>> -fauto-var-init-approach=A|B|C|D to
>>>> choose implementation A or D for
>>>> runtime performance study.
>>>> ** I didn’t finish the uninitialized
>>>> warnings maintenance work for D. (That
>>>> might take more time than I expected).
>>>>
>>>> 2. I collected runtime data for CPU2017
>>>> on a x86 machine with this new gcc for
>>>> the following 3 cases:
>>>>
>>>> no: default. (-g -O2 -march=native )
>>>> A: default +
>>>> -ftrivial-auto-var-init=zero
>>>> -fauto-var-init-approach=A
>>>> D: default +
>>>> -ftrivial-auto-var-init=zero
>>>> -fauto-var-init-approach=D
>>>>
>>>> And then compute the slowdown data for
>>>> both A and D as following:
>>>>
>>>> benchmarks A / no D /no
>>>>
>>>> 500.perlbench_r 1.25% 1.25%
>>>> 502.gcc_r 0.68% 1.80%
>>>> 505.mcf_r 0.68% 0.14%
>>>> 520.omnetpp_r 4.83% 4.68%
>>>> 523.xalancbmk_r 0.18% 1.96%
>>>> 525.x264_r 1.55% 2.07%
>>>> 531.deepsjeng_ 11.57% 11.85%
>>>> 541.leela_r 0.64% 0.80%
>>>> 557.xz_ -0.41% -0.41%
>>>>
>>>> 507.cactuBSSN_r 0.44% 0.44%
>>>> 508.namd_r 0.34% 0.34%
>>>> 510.parest_r 0.17% 0.25%
>>>> 511.povray_r 56.57% 57.27%
>>>> 519.lbm_r 0.00% 0.00%
>>>> 521.wrf_r -0.28% -0.37%
>>>> 526.blender_r 16.96% 17.71%
>>>> 527.cam4_r 0.70% 0.53%
>>>> 538.imagick_r 2.40% 2.40%
>>>> 544.nab_r 0.00% -0.65%
>>>>
>>>> avg 5.17% 5.37%
>>>>
>>>> From the above data, we can see that in
>>>> general, the runtime performance
>>>> slowdown for
>>>> implementation A and D are similar for
>>>> individual benchmarks.
>>>>
>>>> There are several benchmarks that have
>>>> significant slowdown with the new added
>>>> initialization for both
>>>> A and D, for example, 511.povray_r,
>>>> 526.blender_, and 531.deepsjeng_r, I
>>>> will try to study a little bit
>>>> more on what kind of new initializations
>>>> introduced such slowdown.
>>>>
>>>> From the current study so far, I think
>>>> that approach D should be good enough
>>>> for our final implementation.
>>>> So, I will try to finish approach D with
>>>> the following remaining work
>>>>
>>>> ** complete the implementation of
>>>> -ftrivial-auto-var-init=pattern;
>>>> ** complete the implementation of
>>>> uninitialized warnings maintenance work
>>>> for D.
>>>>
>>>> Let me know if you have any comments and
>>>> suggestions on my current and future
>>>> work.
>>>>
>>>> Thanks a lot for your help.
>>>>
>>>> Qing
>>>>
>>>> On Dec 9, 2020, at 10:18 AM,
>>>> Qing Zhao via Gcc-patches
>>>> <gcc-patches@gcc.gnu.org>
>>>> wrote:
>>>>
>>>> The following are the
>>>> approaches I will implement
>>>> and compare:
>>>>
>>>> Our final goal is to keep
>>>> the uninitialized warning
>>>> and minimize the run-time
>>>> performance cost.
>>>>
>>>> A. Adding real
>>>> initialization during
>>>> gimplification, not maintain
>>>> the uninitialized warnings.
>>>> B. Adding real
>>>> initialization during
>>>> gimplification, marking them
>>>> with “artificial_init”.
>>>> Adjusting uninitialized
>>>> pass, maintaining the
>>>> annotation, making sure the
>>>> real init not
>>>> Deleted from the fake
>>>> init.
>>>> C. Marking the DECL for an
>>>> uninitialized auto variable
>>>> as “no_explicit_init” during
>>>> gimplification,
>>>> maintain this
>>>> “no_explicit_init” bit till
>>>> after
>>>> pass_late_warn_uninitialized,
>>>> or till pass_expand,
>>>> add real initialization
>>>> for all DECLs that are
>>>> marked with
>>>> “no_explicit_init”.
>>>> D. Adding .DEFFERED_INIT
>>>> during gimplification,
>>>> expand the .DEFFERED_INIT
>>>> during expand to
>>>> real initialization.
>>>> Adjusting uninitialized pass
>>>> with the new refs with
>>>> “.DEFFERED_INIT”.
>>>>
>>>> In the above, approach A
>>>> will be the one that have
>>>> the minimum run-time cost,
>>>> will be the base for the
>>>> performance
>>>> comparison.
>>>>
>>>> I will implement approach D
>>>> then, this one is expected
>>>> to have the most run-time
>>>> overhead among the above
>>>> list, but
>>>> Implementation should be the
>>>> cleanest among B, C, D.
>>>> Let’s see how much more
>>>> performance overhead this
>>>> approach
>>>> will be. If the data is
>>>> good, maybe we can avoid the
>>>> effort to implement B, and
>>>> C.
>>>>
>>>> If the performance of D is
>>>> not good, I will implement B
>>>> or C at that time.
>>>>
>>>> Let me know if you have any
>>>> comment or suggestions.
>>>>
>>>> Thanks.
>>>>
>>>> Qing
>>>>
>>>> --
>>>> Richard Biener <rguent...@suse.de>
>>>> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
>>>> Nuernberg,
>>>> Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)