On Tue, 11 Jun 2019, Kewen.Lin wrote:

> >> If my understanding on this question is correct, IMHO we should try to make
> >> IVOPTs conservative than optimistic, since once the predict is wrong from
> >> too optimistic decision, the costing on the doloop use is wrong, it's very
> >> possible to affect the global optimal set.  It looks we don't have any ways
> >> to recover it in RTL then?  (otherwise, there should be better place to fix
> >> the PR).  Although it's also possible to miss some good cases, it's at 
> >> least
> >> as good as before, I'm inclined to make it conservative.
> > 
> > I wonder if you could simply benchmark what happens if you make
> > IVOPTs _always_ create a doloop IV (if possible)?  I doubt the
> > cases where a doloop IV is bad (calls, etc.) are too common and
> > that in those cases the extra simple IV hurts.
> > 
> 
> Hi Richard and all,
> 
> With these different settings:
>       Base) without any changes
>       A) having predict_doloop and enable all checks
>       B) A + disable check on "too few iterations" (0 <= est_niter < 3)
>       C) A + disable costly niter check
>       D) A + disable invalid stmt check (call/computed_goto/switch)
> 
> I collected some runtime performance data with SPEC2017 as following:
> 
>                       Avs.Base   Bvs.A  Cvs.A  Dvs.A
> 500.perlbench_r               0.00%   0.38%   0.00%   -0.19%
> 502.gcc_r             0.00%   0.38%   0.00%   0.00%
> 505.mcf_r             0.89%   0.00%   0.00%   0.00%
> 520.omnetpp_r         -0.41%  -1.25%  0.00%   0.00%
> 523.xalancbmk_r               -0.36%  -0.36%  -0.36%  -0.73%
> 525.x264_r            1.14%   0.00%   0.00%   0.00%
> 531.deepsjeng_r               -0.26%  0.26%   0.00%   0.00%
> 541.leela_r           0.00%   0.37%   0.00%   0.00%
> 548.exchange2_r               0.85%   -0.21%  0.00%   0.00%
> 557.xz_r              -0.77%  0.52%   -0.26%  0.00%
> 503.bwaves_r          0.00%   0.00%   0.36%   0.00%
> 507.cactuBSSN_r               -0.57%  0.00%   0.00%   0.00%
> 508.namd_r            -0.69%  0.35%   0.00%   0.35%
> 510.parest_r          0.17%   -0.17%  0.00%   -0.17%
> 511.povray_r          -1.31%  -0.44%  0.15%   0.15%
> 519.lbm_r             0.00%   0.00%   0.00%   0.00%
> 521.wrf_r             0.33%   -0.44%  -0.33%  -0.33%
> 526.blender_r         0.26%   0.26%   0.00%   0.00%
> 527.cam4_r            0.59%   -0.59%  0.00%   -0.39%
> 538.imagick_r         0.45%   0.00%   0.00%   0.00%
> 544.nab_r             0.23%   0.00%   0.00%   0.00%
> 549.fotonik3d_r               1.80%   -0.29%  0.00%   0.00%
> 554.roms_r            0.00%   0.00%   0.00%   0.00%
> geomean                       0.10%   -0.05%  -0.02%  -0.06%
> 
> As above, the difference is very small, looks like caused by noise and can 
> be ignored. 
> 
> I also ran partial of test suite with some explicit statistics dumping 
> (on gcc/g++/gfortran etc.). No regressions found. The unique files number 
> with predicted doloop found are:
> 
>   A) 3297
>   B) 3416
>   C) 3297
>   D) 3858
> 
> Some observations:
>   * Based on A) and C), we can see the checking on costly niter is useless, 
>     I plan to give the check up or replace it with one existing interface 
>     expression_expensive_p as Richard mentioned.  (Correct me if you have 
>     any concerns.)
>   * B) does filter some cases, I checked a few different cases, they are 
>     written with small iteration count indeed.
>   * The delta number isn't small between A) and D).
> 
> I ran some filtering by compiling C/C++ files at -O2 with A) and D) (-O2 is 
> probably not the actual option used in each testing, may not cause the 
> difference, but just for simplicity), obtained dumping assembly file and did
> further comparison, then I got 60 files finally.
> 
> I looked into all 60 cases (I assumed these are typical enough, don't need
> to go through the Fortran etc.).  Most of wi/wo differences are expected, 
> 55 of them use more insns to update biv, but 5 of them are trivial since they
> have to use original biv later so it's kept wi/wo the scanning.  Some of 55 
> takes one more register in prologue/epilogue, some of them have fewer setup 
> insns (which are to calculate the original value and bound for selected iv).
> 
> Some typical case looks like:
> 
>  Replacing exit test: if (ivtmp_2 != 0)                                       
>  \  Replacing exit test: if (ivtmp_2 != 0)
>   xxx ()                                                                      
>  \  xxx ()
>   {                                                                           
>  \  {
>     unsigned long ivtmp.9;                                                    
>  \    unsigned long ivtmp.9;
>     int iter;                                                                 
>  \    int iter;
>     int _1;                                                                   
>  \    int _1;
>   
> -----------------------------------------------------------------------------\
>     unsigned int ivtmp_2;
>   
> -----------------------------------------------------------------------------\
>     unsigned int ivtmp_3;
>     struct bla[100] * _13;                                                    
>  \    struct bla[100] * _13;
>     void * _14;                                                               
>  \    void * _14;
>     unsigned long _15;                                                        
>  \  
> -----------------------------------------------------------------------------
>     unsigned long _16;                                                        
>  \  
> -----------------------------------------------------------------------------
>                                                                               
>  \
>     <bb 2> [local count: 10737418]:                                           
>  \    <bb 2> [local count: 10737418]:
>     _13 = &arr_base + 188;                                                    
>  \    _13 = &arr_base + 188;
>     ivtmp.9_8 = (unsigned long) _13;                                          
>  \    ivtmp.9_8 = (unsigned long) _13;
>     _15 = (unsigned long) &arr_base;                                          
>  \  
> -----------------------------------------------------------------------------
>     _16 = _15 + 44988;                                                        
>  \  
> -----------------------------------------------------------------------------
>                                                                               
>  \
>     <bb 3> [local count: 1063004407]:                                         
>  \    <bb 3> [local count: 1063004407]:
>   
> -----------------------------------------------------------------------------\
>     # ivtmp_3 = PHI <100(2), ivtmp_2(5)>
>     # ivtmp.9_10 = PHI <ivtmp.9_8(2), ivtmp.9_9(5)>                           
>  \    # ivtmp.9_10 = PHI <ivtmp.9_8(2), ivtmp.9_9(5)>
>     _1 = foo ();                                                              
>  \    _1 = foo ();
>     _14 = (void *) ivtmp.9_10;                                                
>  \    _14 = (void *) ivtmp.9_10;
>     MEM[base: _14, offset: 0B] = _1;                                          
>  \    MEM[base: _14, offset: 0B] = _1;
>   
> -----------------------------------------------------------------------------\
>     ivtmp_2 = ivtmp_3 - 1;
>     ivtmp.9_9 = ivtmp.9_10 + 448;                                             
>  \    ivtmp.9_9 = ivtmp.9_10 + 448;
>     if (ivtmp.9_9 != _16)                                                     
>  \    if (ivtmp_2 != 0)
>       goto <bb 5>; [98.99%]                                                   
>  \      goto <bb 5>; [98.99%]
>     else                                                                      
>  \    else
>       goto <bb 4>; [1.01%]                                                    
>  \      goto <bb 4>; [1.01%]
> 
> 
>   >-------addis 31,2,.LC0@toc@ha                                              
>  \  >-------li 31,100
>   >-------ld 31,.LC0@toc@l(31)                                                
>  \  >-------addis 30,2,.LC0@toc@ha
>   >-------addis 30,31,0x1                                                     
>  \  >-------ld 30,.LC0@toc@l(30)
>   >-------std 0,16(1)                                                         
>  \  >-------std 0,16(1)
>   >-------stdu 1,-48(1)                                                       
>  \  >-------stdu 1,-48(1)
>   >-------.cfi_def_cfa_offset 48                                              
>  \  >-------.cfi_def_cfa_offset 48
>   >-------.cfi_offset 65, 16                                                  
>  \  
> -----------------------------------------------------------------------------
>   >-------addi 30,30,-20736                                                   
>  \  >-------.cfi_offset 65, 16
>   >-------.p2align 5                                                          
>  \  >-------.p2align 5
>   .L2:                                                                        
>  \  .L2:
>   >-------bl foo                                                              
>  \  >-------bl foo
>   >-------nop                                                                 
>  \  >-------nop
>   >-------addi 31,31,448                                                      
>  \  >-------addi 9,31,-1
>   >-------stw 3,-448(31)                                                      
>  \  >-------addi 30,30,448
>   >-------cmpld 0,31,30                                                       
>  \  >-------rldicl. 31,9,0,32
>   
> -----------------------------------------------------------------------------\
>   >-------stw 3,-448(30)
>   >-------bne 0,.L2                                                           
>  \  >-------bne 0,.L2
>   >-------addi 1,1,48                                                         
>  \  >-------addi 1,1,48
>   >-------.cfi_def_cfa_offset 0                                               
>  \  >-------.cfi_def_cfa_offset 0
>   >-------ld 0,16(1)                                                          
>  \  >-------ld 0,16(1)
>   >-------ld 30,-16(1)                                                        
>  \  >-------ld 30,-16(1)
>   >-------ld 31,-8(1)                                                         
>  \  >-------ld 31,-8(1)
> 
> 
> As shown above, P8 SPEC2017 performance evaluation shows to disable 
> invalid_stmt scanning doesn't have any impacts.  It looks it's fine
> not to force it.  The test suite small cases show it's possible to
> have sub optimal instruction sequence with eliminated BIV and its update. 
> 
> I guess the concern on the scanning is compilation time cost, is it
> worth to doing according to current data?
> 
> What do you think of this? 

According to the data I'd say keep the code as simple as possible
at start and add tuning or extra checks only later together with
actual testcases where it matters.

Richard.

Reply via email to