https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435
--- Comment #7 from Markus Trippelsdorf <trippels at gcc dot gnu.org> --- (In reply to Yann Collet from comment #6) > The issue seems in fact related to _instruction alignment_. > More precisely, to alignment of some critical loop. > > That's basically why adding some code in the file would just "pushes" some > other code into another position, potentially into a less favorable path > (hence the appearance of "random impact"). > > > The following GCC command saved the day : > -falign-loops=32 > > Note that -falign-loops=16 doesn't work. > I'm suspecting it might be the default value, but can't be sure. > I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell > cpu. Here are the default values (from gcc/config/i386/i386.c): 2540 /* Processor target table, indexed by processor number */ 2541 struct ptt 2542 { 2543 const char *const name; /* processor name */ 2544 const struct processor_costs *cost; /* Processor costs */ 2545 const int align_loop; /* Default alignments. */ 2546 const int align_loop_max_skip; 2547 const int align_jump; 2548 const int align_jump_max_skip; 2549 const int align_func; 2550 }; 2551 2552 /* This table must be in sync with enum processor_type in i386.h. */ 2553 static const struct ptt processor_target_table[PROCESSOR_max] = 2554 { 2555 {"generic", &generic_cost, 16, 10, 16, 10, 16}, 2556 {"i386", &i386_cost, 4, 3, 4, 3, 4}, 2557 {"i486", &i486_cost, 16, 15, 16, 15, 16}, 2558 {"pentium", &pentium_cost, 16, 7, 16, 7, 16}, 2559 {"iamcu", &iamcu_cost, 16, 7, 16, 7, 16}, 2560 {"pentiumpro", &pentiumpro_cost, 16, 15, 16, 10, 16}, 2561 {"pentium4", &pentium4_cost, 0, 0, 0, 0, 0}, 2562 {"nocona", &nocona_cost, 0, 0, 0, 0, 0}, 2563 {"core2", &core_cost, 16, 10, 16, 10, 16}, 2564 {"nehalem", &core_cost, 16, 10, 16, 10, 16}, 2565 {"sandybridge", &core_cost, 16, 10, 16, 10, 16}, 2566 {"haswell", &core_cost, 16, 10, 16, 10, 16}, 2567 {"bonnell", &atom_cost, 16, 15, 16, 7, 16}, 2568 {"silvermont", &slm_cost, 16, 15, 16, 7, 16}, 2569 {"knl", &slm_cost, 16, 15, 16, 7, 16}, 2570 {"intel", &intel_cost, 16, 15, 16, 7, 16}, 2571 {"geode", &geode_cost, 0, 0, 0, 0, 0}, 2572 {"k6", &k6_cost, 32, 7, 32, 7, 32}, 2573 {"athlon", &athlon_cost, 16, 7, 16, 7, 16}, 2574 {"k8", &k8_cost, 16, 7, 16, 7, 16}, 2575 {"amdfam10", &amdfam10_cost, 32, 24, 32, 7, 32}, 2576 {"bdver1", &bdver1_cost, 16, 10, 16, 7, 11}, 2577 {"bdver2", &bdver2_cost, 16, 10, 16, 7, 11}, 2578 {"bdver3", &bdver3_cost, 16, 10, 16, 7, 11}, 2579 {"bdver4", &bdver4_cost, 16, 10, 16, 7, 11}, 2580 {"btver1", &btver1_cost, 16, 10, 16, 7, 11}, 2581 {"btver2", &btver2_cost, 16, 10, 16, 7, 11} 2582 }; As you can see only AMD's k6 and amdfam10 default to align_loop=32. > Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter. > It seems not possible to apply this optimization from within the source file, > such as using : > #pragma GCC optimize ("align-loops=32") > or the function targeted : > __attribute__((optimize("align-loops=32"))) > > None of these alternatives does work. I don't think this makes much sense for a binary that should run on any X86 processor anyway. Optimizing for just one specific model will negatively affect performance on an other. If you want maximal performance you need to offer different binaries for different CPUs. See also (for a similar issue): http://pzemtsov.github.io/2014/05/12/mystery-of-unstable-performance.html