[Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment)

trippels at gcc dot gnu.org Thu, 03 Sep 2015 00:19:39 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435


--- Comment #7 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
(In reply to Yann Collet from comment #6)
> The issue seems in fact related to _instruction alignment_.
> More precisely, to alignment of some critical loop.
> 
> That's basically why adding some code in the file would just "pushes" some
> other code into another position, potentially into a less favorable path
> (hence the appearance of "random impact").
> 
> 
> The following GCC command saved the day :
> -falign-loops=32
> 
> Note that -falign-loops=16 doesn't work.
> I'm suspecting it might be the default value, but can't be sure.
> I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell
> cpu.

Here are the default values (from gcc/config/i386/i386.c):

 2540 /* Processor target table, indexed by processor number */                 
 2541 struct ptt                                                                
 2542 {                                                                         
 2543   const char *const name;                       /* processor name  */     
 2544   const struct processor_costs *cost;           /* Processor costs */     
 2545   const int align_loop;                         /* Default alignments. 
*/                                                                              
 2546   const int align_loop_max_skip;                                          
 2547   const int align_jump;                                                   
 2548   const int align_jump_max_skip;                                          
 2549   const int align_func;                                                   
 2550 };                                                                        
 2551                                                                           
 2552 /* This table must be in sync with enum processor_type in i386.h.  */     
 2553 static const struct ptt processor_target_table[PROCESSOR_max] =           
 2554 {                                                                         
 2555   {"generic", &generic_cost, 16, 10, 16, 10, 16},                         
 2556   {"i386", &i386_cost, 4, 3, 4, 3, 4},                                    
 2557   {"i486", &i486_cost, 16, 15, 16, 15, 16},                               
 2558   {"pentium", &pentium_cost, 16, 7, 16, 7, 16},                           
 2559   {"iamcu", &iamcu_cost, 16, 7, 16, 7, 16},                               
 2560   {"pentiumpro", &pentiumpro_cost, 16, 15, 16, 10, 16},                   
 2561   {"pentium4", &pentium4_cost, 0, 0, 0, 0, 0},                            
 2562   {"nocona", &nocona_cost, 0, 0, 0, 0, 0},                                
 2563   {"core2", &core_cost, 16, 10, 16, 10, 16},                              
 2564   {"nehalem", &core_cost, 16, 10, 16, 10, 16},                            
 2565   {"sandybridge", &core_cost, 16, 10, 16, 10, 16},                        
 2566   {"haswell", &core_cost, 16, 10, 16, 10, 16},                            
 2567   {"bonnell", &atom_cost, 16, 15, 16, 7, 16},                             
 2568   {"silvermont", &slm_cost, 16, 15, 16, 7, 16},                           
 2569   {"knl", &slm_cost, 16, 15, 16, 7, 16},                                  
 2570   {"intel", &intel_cost, 16, 15, 16, 7, 16},                              
 2571   {"geode", &geode_cost, 0, 0, 0, 0, 0},                                  
 2572   {"k6", &k6_cost, 32, 7, 32, 7, 32},                                     
 2573   {"athlon", &athlon_cost, 16, 7, 16, 7, 16},                             
 2574   {"k8", &k8_cost, 16, 7, 16, 7, 16},                                     
 2575   {"amdfam10", &amdfam10_cost, 32, 24, 32, 7, 32},                        
 2576   {"bdver1", &bdver1_cost, 16, 10, 16, 7, 11},                            
 2577   {"bdver2", &bdver2_cost, 16, 10, 16, 7, 11},                            
 2578   {"bdver3", &bdver3_cost, 16, 10, 16, 7, 11},                            
 2579   {"bdver4", &bdver4_cost, 16, 10, 16, 7, 11},                            
 2580   {"btver1", &btver1_cost, 16, 10, 16, 7, 11},                            
 2581   {"btver2", &btver2_cost, 16, 10, 16, 7, 11}                             
 2582 };  

As you can see only AMD's k6 and amdfam10 default to align_loop=32.

> Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter.
> It seems not possible to apply this optimization from within the source file,
> such as using :
> #pragma GCC optimize ("align-loops=32")
> or the function targeted :
> __attribute__((optimize("align-loops=32")))
> 
> None of these alternatives does work.

I don't think this makes much sense for a binary that should run on
any X86 processor anyway. Optimizing for just one specific model will
negatively affect performance on an other.
If you want maximal performance you need to offer different binaries for
different CPUs.

See also (for a similar issue):
http://pzemtsov.github.io/2014/05/12/mystery-of-unstable-performance.html

[Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment)

Reply via email to