On January 15, 2021 5:16:40 PM GMT+01:00, Qing Zhao <qing.z...@oracle.com> wrote: > > >> On Jan 15, 2021, at 2:11 AM, Richard Biener <rguent...@suse.de> >wrote: >> >> >> >> On Thu, 14 Jan 2021, Qing Zhao wrote: >> >>> Hi, >>> More data on code size and compilation time with CPU2017: >>> ********Compilation time data: the numbers are the slowdown >against the >>> default “no”: >>> benchmarks A/no D/no >>> >>> 500.perlbench_r 5.19% 1.95% >>> 502.gcc_r 0.46% -0.23% >>> 505.mcf_r 0.00% 0.00% >>> 520.omnetpp_r 0.85% 0.00% >>> 523.xalancbmk_r 0.79% -0.40% >>> 525.x264_r -4.48% 0.00% >>> 531.deepsjeng_r 16.67% 16.67% >>> 541.leela_r 0.00% 0.00% >>> 557.xz_r 0.00% 0.00% >>> >>> 507.cactuBSSN_r 1.16% 0.58% >>> 508.namd_r 9.62% 8.65% >>> 510.parest_r 0.48% 1.19% >>> 511.povray_r 3.70% 3.70% >>> 519.lbm_r 0.00% 0.00% >>> 521.wrf_r 0.05% 0.02% >>> 526.blender_r 0.33% 1.32% >>> 527.cam4_r -0.93% -0.93% >>> 538.imagick_r 1.32% 3.95% >>> 544.nab_r 0.00% 0.00% >>> From the above data, looks like that the compilation time impact >>> from implementation A and D are almost the same. >>> *******code size data: the numbers are the code size increase >against the >>> default “no”: >>> benchmarks A/no D/no >>> >>> 500.perlbench_r 2.84% 0.34% >>> 502.gcc_r 2.59% 0.35% >>> 505.mcf_r 3.55% 0.39% >>> 520.omnetpp_r 0.54% 0.03% >>> 523.xalancbmk_r 0.36% 0.39% >>> 525.x264_r 1.39% 0.13% >>> 531.deepsjeng_r 2.15% -1.12% >>> 541.leela_r 0.50% -0.20% >>> 557.xz_r 0.31% 0.13% >>> >>> 507.cactuBSSN_r 5.00% -0.01% >>> 508.namd_r 3.64% -0.07% >>> 510.parest_r 1.12% 0.33% >>> 511.povray_r 4.18% 1.16% >>> 519.lbm_r 8.83% 6.44% >>> 521.wrf_r 0.08% 0.02% >>> 526.blender_r 1.63% 0.45% >>> 527.cam4_r 0.16% 0.06% >>> 538.imagick_r 3.18% -0.80% >>> 544.nab_r 5.76% -1.11% >>> Avg 2.52% 0.36% >>> From the above data, the implementation D is always better than A, >it’s a >>> surprising to me, not sure what’s the reason for this. >> >> D probably inhibits most interesting loop transforms (check SPEC FP >> performance). > >The call to .DEFERRED_INIT is marked as ECF_CONST: > >/* A function to represent an artifical initialization to an >uninitialized > automatic variable. The first argument is the variable itself, the > second argument is the initialization type. */ >DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, >NULL) > >So, I assume that such const call should minimize the impact to loop >optimizations. But yes, it will still inhibit some of the loop >transformations. > >> It will also most definitely disallow SRA which, when >> an aggregate is not completely elided, tends to grow code. > >Make sense to me. > >The run-time performance data for D and A are actually very similar as >I posted in the previous email (I listed it here for convenience) > >Run-time performance overhead with A and D: > >benchmarks A / no D /no > >500.perlbench_r 1.25% 1.25% >502.gcc_r 0.68% 1.80% >505.mcf_r 0.68% 0.14% >520.omnetpp_r 4.83% 4.68% >523.xalancbmk_r 0.18% 1.96% >525.x264_r 1.55% 2.07% >531.deepsjeng_ 11.57% 11.85% >541.leela_r 0.64% 0.80% >557.xz_ -0.41% -0.41% > >507.cactuBSSN_r 0.44% 0.44% >508.namd_r 0.34% 0.34% >510.parest_r 0.17% 0.25% >511.povray_r 56.57% 57.27% >519.lbm_r 0.00% 0.00% >521.wrf_r -0.28% -0.37% >526.blender_r 16.96% 17.71% >527.cam4_r 0.70% 0.53% >538.imagick_r 2.40% 2.40% >544.nab_r 0.00% -0.65% > >avg 5.17% 5.37% > >Especially for the SPEC FP benchmarks, I didn’t see too much >performance difference between A and D. >I guess that the RTL optimizations might be enough to get rid of most >of the overhead introduced by the additional initialization. > >> >>> ********stack usage data, I added -fstack-usage to the compilation >line when >>> compiling CPU2017 benchmarks. And all the *.su files were generated >for each >>> of the modules. >>> Since there a lot of such files, and the stack size information are >embedded >>> in each of the files. I just picked up one benchmark 511.povray to >>> check. Which is the one that >>> has the most runtime overhead when adding initialization (both A and >D). >>> I identified all the *.su files that are different between A and D >and do a >>> diff on those *.su files, and looks like that the stack size is much >higher >>> with D than that with A, for example: >>> $ diff build_base_auto_init.D.0000/bbox.su >>> build_base_auto_init.A.0000/bbox.su5c5 >>> < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**, >>> pov::BBOX_TREE**&, long int*, long int, long int) 160 static >>> --- >>> > bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**, >>> pov::BBOX_TREE**&, long int*, long int, long int) 96 static >>> $ diff build_base_auto_init.D.0000/image.su >>> build_base_auto_init.A.0000/image.su >>> 9c9 >>> < image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, >double*) 624 >>> static >>> --- >>> > image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, >double*) 272 >>> static >>> …. >>> Looks like that implementation D has more stack size impact than A. >>> Do you have any insight on what the reason for this? >> >> D will keep all initialized aggregates as aggregates and live which >> means stack will be allocated for it. With A the usual optimizations >> to reduce stack usage can be applied. > >I checked the routine “poverties::bump_map” in 511.povray_r since it >has a lot stack increase >due to implementation D, by examine the IR immediate before RTL >expansion phase. >(image.cpp.244t.optimized), I found that we have the following >additional statements for the array elements: > >void pov::bump_map (double * EPoint, struct TNORMAL * Tnormal, double >* normal) >{ >… > double p3[3]; > double p2[3]; > double p1[3]; > float colour3[5]; > float colour2[5]; > float colour1[5]; >… > # DEBUG BEGIN_STMT > colour1 = .DEFERRED_INIT (colour1, 2); > colour2 = .DEFERRED_INIT (colour2, 2); > colour3 = .DEFERRED_INIT (colour3, 2); > # DEBUG BEGIN_STMT > MEM <double> [(double[3] *)&p1] = p1$0_144(D); > MEM <double> [(double[3] *)&p1 + 8B] = p1$1_135(D); > MEM <double> [(double[3] *)&p1 + 16B] = p1$2_138(D); > p1 = .DEFERRED_INIT (p1, 2); > # DEBUG D#12 => MEM <double> [(double[3] *)&p1] > # DEBUG p1$0 => D#12 > # DEBUG D#11 => MEM <double> [(double[3] *)&p1 + 8B] > # DEBUG p1$1 => D#11 > # DEBUG D#10 => MEM <double> [(double[3] *)&p1 + 16B] > # DEBUG p1$2 => D#10 > MEM <double> [(double[3] *)&p2] = p2$0_109(D); > MEM <double> [(double[3] *)&p2 + 8B] = p2$1_111(D); > MEM <double> [(double[3] *)&p2 + 16B] = p2$2_254(D); > p2 = .DEFERRED_INIT (p2, 2); > # DEBUG D#9 => MEM <double> [(double[3] *)&p2] > # DEBUG p2$0 => D#9 > # DEBUG D#8 => MEM <double> [(double[3] *)&p2 + 8B] > # DEBUG p2$1 => D#8 > # DEBUG D#7 => MEM <double> [(double[3] *)&p2 + 16B] > # DEBUG p2$2 => D#7 > MEM <double> [(double[3] *)&p3] = p3$0_256(D); > MEM <double> [(double[3] *)&p3 + 8B] = p3$1_258(D); > MEM <double> [(double[3] *)&p3 + 16B] = p3$2_260(D); > p3 = .DEFERRED_INIT (p3, 2); > …. >} > >I guess that the above “MEM <double>….. = …” are the ones that make the >differences. Which phase introduced them?
Looks like SRA. But you can just dump all and grep for the first occurrence. >> >>> Let me know if you have any comments and suggestions. >> >> First of all I would check whether the prototype implementations >> work as expected. >I have done such check with small testing cases already, checking the >IR generated with the implementation A or D, mainly >Focus on *.c.006t.gimple. and *.c.*t.expand, all worked as expected. > >For the CPU2017, for example as the above, I also checked the IR for >both A and D, looks like all worked as expected. > >Thanks. > >Qing >> >> Richard. >> >> >>> thanks. >>> Qing >>> On Jan 13, 2021, at 1:39 AM, Richard Biener <rguent...@suse.de> >>> wrote: >>> >>> On Tue, 12 Jan 2021, Qing Zhao wrote: >>> >>> Hi, >>> >>> Just check in to see whether you have any comments >>> and suggestions on this: >>> >>> FYI, I have been continue with Approach D >>> implementation since last week: >>> >>> D. Adding calls to .DEFFERED_INIT during >>> gimplification, expand the .DEFFERED_INIT during >>> expand to >>> real initialization. Adjusting uninitialized pass >>> with the new refs with “.DEFFERED_INIT”. >>> >>> For the remaining work of Approach D: >>> >>> ** complete the implementation of >>> -ftrivial-auto-var-init=pattern; >>> ** complete the implementation of uninitialized >>> warnings maintenance work for D. >>> >>> I have completed the uninitialized warnings >>> maintenance work for D. >>> And finished partial of the >>> -ftrivial-auto-var-init=pattern implementation. >>> >>> The following are remaining work of Approach D: >>> >>> ** -ftrivial-auto-var-init=pattern for VLA; >>> **add a new attribute for variable: >>> __attribute((uninitialized) >>> the marked variable is uninitialized intentionaly >>> for performance purpose. >>> ** adding complete testing cases; >>> >>> Please let me know if you have any objection on my >>> current decision on implementing approach D. >>> >>> Did you do any analysis on how stack usage and code size are >>> changed >>> with approach D? How does compile-time behave (we could gobble >>> up >>> lots of .DEFERRED_INIT calls I guess)? >>> >>> Richard. >>> >>> Thanks a lot for your help. >>> >>> Qing >>> >>> On Jan 5, 2021, at 1:05 PM, Qing Zhao >>> via Gcc-patches >>> <gcc-patches@gcc.gnu.org> wrote: >>> >>> Hi, >>> >>> This is an update for our previous >>> discussion. >>> >>> 1. I implemented the following two >>> different implementations in the latest >>> upstream gcc: >>> >>> A. Adding real initialization during >>> gimplification, not maintain the >>> uninitialized warnings. >>> >>> D. Adding calls to .DEFFERED_INIT >>> during gimplification, expand the >>> .DEFFERED_INIT during expand to >>> real initialization. Adjusting >>> uninitialized pass with the new refs >>> with “.DEFFERED_INIT”. >>> >>> Note, in this initial implementation, >>> ** I ONLY implement >>> -ftrivial-auto-var-init=zero, the >>> implementation of >>> -ftrivial-auto-var-init=pattern >>> is not done yet. Therefore, the >>> performance data is only about >>> -ftrivial-auto-var-init=zero. >>> >>> ** I added an temporary option >>> -fauto-var-init-approach=A|B|C|D to >>> choose implementation A or D for >>> runtime performance study. >>> ** I didn’t finish the uninitialized >>> warnings maintenance work for D. (That >>> might take more time than I expected). >>> >>> 2. I collected runtime data for CPU2017 >>> on a x86 machine with this new gcc for >>> the following 3 cases: >>> >>> no: default. (-g -O2 -march=native ) >>> A: default + >>> -ftrivial-auto-var-init=zero >>> -fauto-var-init-approach=A >>> D: default + >>> -ftrivial-auto-var-init=zero >>> -fauto-var-init-approach=D >>> >>> And then compute the slowdown data for >>> both A and D as following: >>> >>> benchmarks A / no D /no >>> >>> 500.perlbench_r 1.25% 1.25% >>> 502.gcc_r 0.68% 1.80% >>> 505.mcf_r 0.68% 0.14% >>> 520.omnetpp_r 4.83% 4.68% >>> 523.xalancbmk_r 0.18% 1.96% >>> 525.x264_r 1.55% 2.07% >>> 531.deepsjeng_ 11.57% 11.85% >>> 541.leela_r 0.64% 0.80% >>> 557.xz_ -0.41% -0.41% >>> >>> 507.cactuBSSN_r 0.44% 0.44% >>> 508.namd_r 0.34% 0.34% >>> 510.parest_r 0.17% 0.25% >>> 511.povray_r 56.57% 57.27% >>> 519.lbm_r 0.00% 0.00% >>> 521.wrf_r -0.28% -0.37% >>> 526.blender_r 16.96% 17.71% >>> 527.cam4_r 0.70% 0.53% >>> 538.imagick_r 2.40% 2.40% >>> 544.nab_r 0.00% -0.65% >>> >>> avg 5.17% 5.37% >>> >>> From the above data, we can see that in >>> general, the runtime performance >>> slowdown for >>> implementation A and D are similar for >>> individual benchmarks. >>> >>> There are several benchmarks that have >>> significant slowdown with the new added >>> initialization for both >>> A and D, for example, 511.povray_r, >>> 526.blender_, and 531.deepsjeng_r, I >>> will try to study a little bit >>> more on what kind of new initializations >>> introduced such slowdown. >>> >>> From the current study so far, I think >>> that approach D should be good enough >>> for our final implementation. >>> So, I will try to finish approach D with >>> the following remaining work >>> >>> ** complete the implementation of >>> -ftrivial-auto-var-init=pattern; >>> ** complete the implementation of >>> uninitialized warnings maintenance work >>> for D. >>> >>> Let me know if you have any comments and >>> suggestions on my current and future >>> work. >>> >>> Thanks a lot for your help. >>> >>> Qing >>> >>> On Dec 9, 2020, at 10:18 AM, >>> Qing Zhao via Gcc-patches >>> <gcc-patches@gcc.gnu.org> >>> wrote: >>> >>> The following are the >>> approaches I will implement >>> and compare: >>> >>> Our final goal is to keep >>> the uninitialized warning >>> and minimize the run-time >>> performance cost. >>> >>> A. Adding real >>> initialization during >>> gimplification, not maintain >>> the uninitialized warnings. >>> B. Adding real >>> initialization during >>> gimplification, marking them >>> with “artificial_init”. >>> Adjusting uninitialized >>> pass, maintaining the >>> annotation, making sure the >>> real init not >>> Deleted from the fake >>> init. >>> C. Marking the DECL for an >>> uninitialized auto variable >>> as “no_explicit_init” during >>> gimplification, >>> maintain this >>> “no_explicit_init” bit till >>> after >>> pass_late_warn_uninitialized, >>> or till pass_expand, >>> add real initialization >>> for all DECLs that are >>> marked with >>> “no_explicit_init”. >>> D. Adding .DEFFERED_INIT >>> during gimplification, >>> expand the .DEFFERED_INIT >>> during expand to >>> real initialization. >>> Adjusting uninitialized pass >>> with the new refs with >>> “.DEFFERED_INIT”. >>> >>> In the above, approach A >>> will be the one that have >>> the minimum run-time cost, >>> will be the base for the >>> performance >>> comparison. >>> >>> I will implement approach D >>> then, this one is expected >>> to have the most run-time >>> overhead among the above >>> list, but >>> Implementation should be the >>> cleanest among B, C, D. >>> Let’s see how much more >>> performance overhead this >>> approach >>> will be. If the data is >>> good, maybe we can avoid the >>> effort to implement B, and >>> C. >>> >>> If the performance of D is >>> not good, I will implement B >>> or C at that time. >>> >>> Let me know if you have any >>> comment or suggestions. >>> >>> Thanks. >>> >>> Qing >>> >>> -- >>> Richard Biener <rguent...@suse.de> >>> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 >>> Nuernberg, >>> Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)