Hi, More data on code size and compilation time with CPU2017:
********Compilation time data: the numbers are the slowdown against the default “no”: benchmarks A/no D/no 500.perlbench_r 5.19% 1.95% 502.gcc_r 0.46% -0.23% 505.mcf_r 0.00% 0.00% 520.omnetpp_r 0.85% 0.00% 523.xalancbmk_r 0.79% -0.40% 525.x264_r -4.48% 0.00% 531.deepsjeng_r 16.67% 16.67% 541.leela_r 0.00% 0.00% 557.xz_r 0.00% 0.00% 507.cactuBSSN_r 1.16% 0.58% 508.namd_r 9.62% 8.65% 510.parest_r 0.48% 1.19% 511.povray_r 3.70% 3.70% 519.lbm_r 0.00% 0.00% 521.wrf_r 0.05% 0.02% 526.blender_r 0.33% 1.32% 527.cam4_r -0.93% -0.93% 538.imagick_r 1.32% 3.95% 544.nab_r 0.00% 0.00% From the above data, looks like that the compilation time impact from implementation A and D are almost the same. *******code size data: the numbers are the code size increase against the default “no”: benchmarks A/no D/no 500.perlbench_r 2.84% 0.34% 502.gcc_r 2.59% 0.35% 505.mcf_r 3.55% 0.39% 520.omnetpp_r 0.54% 0.03% 523.xalancbmk_r 0.36% 0.39% 525.x264_r 1.39% 0.13% 531.deepsjeng_r 2.15% -1.12% 541.leela_r 0.50% -0.20% 557.xz_r 0.31% 0.13% 507.cactuBSSN_r 5.00% -0.01% 508.namd_r 3.64% -0.07% 510.parest_r 1.12% 0.33% 511.povray_r 4.18% 1.16% 519.lbm_r 8.83% 6.44% 521.wrf_r 0.08% 0.02% 526.blender_r 1.63% 0.45% 527.cam4_r 0.16% 0.06% 538.imagick_r 3.18% -0.80% 544.nab_r 5.76% -1.11% Avg 2.52% 0.36% From the above data, the implementation D is always better than A, it’s a surprising to me, not sure what’s the reason for this. ********stack usage data, I added -fstack-usage to the compilation line when compiling CPU2017 benchmarks. And all the *.su files were generated for each of the modules. Since there a lot of such files, and the stack size information are embedded in each of the files. I just picked up one benchmark 511.povray to check. Which is the one that has the most runtime overhead when adding initialization (both A and D). I identified all the *.su files that are different between A and D and do a diff on those *.su files, and looks like that the stack size is much higher with D than that with A, for example: $ diff build_base_auto_init.D.0000/bbox.su build_base_auto_init.A.0000/bbox.su 5c5 < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**, pov::BBOX_TREE**&, long int*, long int, long int) 160 static --- > bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**, pov::BBOX_TREE**&, > long int*, long int, long int) 96 static $ diff build_base_auto_init.D.0000/image.su build_base_auto_init.A.0000/image.su 9c9 < image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 624 static --- > image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 272 > static …. Looks like that implementation D has more stack size impact than A. Do you have any insight on what the reason for this? Let me know if you have any comments and suggestions. thanks. Qing > On Jan 13, 2021, at 1:39 AM, Richard Biener <rguent...@suse.de> wrote: > > On Tue, 12 Jan 2021, Qing Zhao wrote: > >> Hi, >> >> Just check in to see whether you have any comments and suggestions on this: >> >> FYI, I have been continue with Approach D implementation since last week: >> >> D. Adding calls to .DEFFERED_INIT during gimplification, expand the >> .DEFFERED_INIT during expand to >> real initialization. Adjusting uninitialized pass with the new refs with >> “.DEFFERED_INIT”. >> >> For the remaining work of Approach D: >> >> ** complete the implementation of -ftrivial-auto-var-init=pattern; >> ** complete the implementation of uninitialized warnings maintenance work >> for D. >> >> I have completed the uninitialized warnings maintenance work for D. >> And finished partial of the -ftrivial-auto-var-init=pattern implementation. >> >> The following are remaining work of Approach D: >> >> ** -ftrivial-auto-var-init=pattern for VLA; >> **add a new attribute for variable: >> __attribute((uninitialized) >> the marked variable is uninitialized intentionaly for performance purpose. >> ** adding complete testing cases; >> >> >> Please let me know if you have any objection on my current decision on >> implementing approach D. > > Did you do any analysis on how stack usage and code size are changed > with approach D? How does compile-time behave (we could gobble up > lots of .DEFERRED_INIT calls I guess)? > > Richard. > >> Thanks a lot for your help. >> >> Qing >> >> >>> On Jan 5, 2021, at 1:05 PM, Qing Zhao via Gcc-patches >>> <gcc-patches@gcc.gnu.org> wrote: >>> >>> Hi, >>> >>> This is an update for our previous discussion. >>> >>> 1. I implemented the following two different implementations in the latest >>> upstream gcc: >>> >>> A. Adding real initialization during gimplification, not maintain the >>> uninitialized warnings. >>> >>> D. Adding calls to .DEFFERED_INIT during gimplification, expand the >>> .DEFFERED_INIT during expand to >>> real initialization. Adjusting uninitialized pass with the new refs with >>> “.DEFFERED_INIT”. >>> >>> Note, in this initial implementation, >>> ** I ONLY implement -ftrivial-auto-var-init=zero, the implementation of >>> -ftrivial-auto-var-init=pattern >>> is not done yet. Therefore, the performance data is only about >>> -ftrivial-auto-var-init=zero. >>> >>> ** I added an temporary option -fauto-var-init-approach=A|B|C|D to >>> choose implementation A or D for >>> runtime performance study. >>> ** I didn’t finish the uninitialized warnings maintenance work for D. >>> (That might take more time than I expected). >>> >>> 2. I collected runtime data for CPU2017 on a x86 machine with this new gcc >>> for the following 3 cases: >>> >>> no: default. (-g -O2 -march=native ) >>> A: default + -ftrivial-auto-var-init=zero -fauto-var-init-approach=A >>> D: default + -ftrivial-auto-var-init=zero -fauto-var-init-approach=D >>> >>> And then compute the slowdown data for both A and D as following: >>> >>> benchmarks A / no D /no >>> >>> 500.perlbench_r 1.25% 1.25% >>> 502.gcc_r 0.68% 1.80% >>> 505.mcf_r 0.68% 0.14% >>> 520.omnetpp_r 4.83% 4.68% >>> 523.xalancbmk_r 0.18% 1.96% >>> 525.x264_r 1.55% 2.07% >>> 531.deepsjeng_ 11.57% 11.85% >>> 541.leela_r 0.64% 0.80% >>> 557.xz_ -0.41% -0.41% >>> >>> 507.cactuBSSN_r 0.44% 0.44% >>> 508.namd_r 0.34% 0.34% >>> 510.parest_r 0.17% 0.25% >>> 511.povray_r 56.57% 57.27% >>> 519.lbm_r 0.00% 0.00% >>> 521.wrf_r -0.28% -0.37% >>> 526.blender_r 16.96% 17.71% >>> 527.cam4_r 0.70% 0.53% >>> 538.imagick_r 2.40% 2.40% >>> 544.nab_r 0.00% -0.65% >>> >>> avg 5.17% 5.37% >>> >>> From the above data, we can see that in general, the runtime performance >>> slowdown for >>> implementation A and D are similar for individual benchmarks. >>> >>> There are several benchmarks that have significant slowdown with the new >>> added initialization for both >>> A and D, for example, 511.povray_r, 526.blender_, and 531.deepsjeng_r, I >>> will try to study a little bit >>> more on what kind of new initializations introduced such slowdown. >>> >>> From the current study so far, I think that approach D should be good >>> enough for our final implementation. >>> So, I will try to finish approach D with the following remaining work >>> >>> ** complete the implementation of -ftrivial-auto-var-init=pattern; >>> ** complete the implementation of uninitialized warnings maintenance >>> work for D. >>> >>> >>> Let me know if you have any comments and suggestions on my current and >>> future work. >>> >>> Thanks a lot for your help. >>> >>> Qing >>> >>>> On Dec 9, 2020, at 10:18 AM, Qing Zhao via Gcc-patches >>>> <gcc-patches@gcc.gnu.org> wrote: >>>> >>>> The following are the approaches I will implement and compare: >>>> >>>> Our final goal is to keep the uninitialized warning and minimize the >>>> run-time performance cost. >>>> >>>> A. Adding real initialization during gimplification, not maintain the >>>> uninitialized warnings. >>>> B. Adding real initialization during gimplification, marking them with >>>> “artificial_init”. >>>> Adjusting uninitialized pass, maintaining the annotation, making sure >>>> the real init not >>>> Deleted from the fake init. >>>> C. Marking the DECL for an uninitialized auto variable as >>>> “no_explicit_init” during gimplification, >>>> maintain this “no_explicit_init” bit till after >>>> pass_late_warn_uninitialized, or till pass_expand, >>>> add real initialization for all DECLs that are marked with >>>> “no_explicit_init”. >>>> D. Adding .DEFFERED_INIT during gimplification, expand the .DEFFERED_INIT >>>> during expand to >>>> real initialization. Adjusting uninitialized pass with the new refs with >>>> “.DEFFERED_INIT”. >>>> >>>> >>>> In the above, approach A will be the one that have the minimum run-time >>>> cost, will be the base for the performance >>>> comparison. >>>> >>>> I will implement approach D then, this one is expected to have the most >>>> run-time overhead among the above list, but >>>> Implementation should be the cleanest among B, C, D. Let’s see how much >>>> more performance overhead this approach >>>> will be. If the data is good, maybe we can avoid the effort to implement >>>> B, and C. >>>> >>>> If the performance of D is not good, I will implement B or C at that time. >>>> >>>> Let me know if you have any comment or suggestions. >>>> >>>> Thanks. >>>> >>>> Qing >>> >> >> > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, > Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)