On Tue, 24 Jul 2018, Fredrik Hederstierna wrote: > So my question is how to approach this problems when doing benchmarking, > ofcourse we want the benchmark to mirror as near as 'real life' code as > possible. But if code contains real bugs, and issues that could cause > unpredictable code generation, should such code be banned from benchmarking, > since results might be misleading?
Well, all benchmarks are going to be imperfect reflections of real-life workloads in the first place, so their bugs just increase the degree to which they are misleading. When a new compiler version starts to treat some undefined piece of code differently, it can cause a range of effects from code size perturbations as in your case, to completely invalidating the benchmark as in spec2k6 x264 benchmark's case (where GCC exploited undefined behavior in a loop, turning it to an infinite loop that eventually segfaulted). Perhaps even though results on individual benchmarks can vary wildly, aggregated results across a wide range of non-toy benchmarks may be indicative of ... something, because they are unlikely to all exhibit the same "bugs". > On the other hand, the compiler should > generate best code for any input? Engineering effort is limited, so it's probably better to go for generating good code for inputs that are likely to resemble actively used code (and in actively used&maintained code, bugs can be reported and fixed) :) > What do you think, should benchmarking code not being allowed to have eg > warnings like -Wuninitialized and maybe -Wmaybe-uninitialized? Are there more > warnings that indicate unpredictable code generations due to bad code, or are > the root cause that these are 'bugs', and we should not allow real bugs at all > in benchmarking code? A blanket ban on warnings won't work, they have false positives (especially the -Wmaybe- one), and there exist code that validly uses uninitialized data. I don't have such a striking example for scalar variables, but for uninitialized arrays there's this sparse set algorithm (which GCC itself also uses): https://research.swtch.com/sparse I think good benchmarks sets should be able to evolve to account for newly discovered bugs, rather then remain frozen (which sounds like a reason to become obsolete sooner rather than later). Alexander