On Tue, 24 Jul 2018, Fredrik Hederstierna wrote:

> So my question is how to approach this problems when doing benchmarking,
> ofcourse we want the benchmark to mirror as near as 'real life' code as
> possible.  But if code contains real bugs, and issues that could cause
> unpredictable code generation, should such code be banned from benchmarking,
> since results might be misleading?

Well, all benchmarks are going to be imperfect reflections of real-life
workloads in the first place, so their bugs just increase the degree to
which they are misleading.

When a new compiler version starts to treat some undefined piece of code
differently, it can cause a range of effects from code size perturbations
as in your case, to completely invalidating the benchmark as in spec2k6
x264 benchmark's case (where GCC exploited undefined behavior in a loop,
turning it to an infinite loop that eventually segfaulted).

Perhaps even though results on individual benchmarks can vary wildly,
aggregated results across a wide range of non-toy benchmarks may be
indicative of ... something, because they are unlikely to all exhibit
the same "bugs".

> On the other hand, the compiler should
> generate best code for any input?

Engineering effort is limited, so it's probably better to go for generating
good code for inputs that are likely to resemble actively used code (and in
actively used&maintained code, bugs can be reported and fixed) :)
 
> What do you think, should benchmarking code not being allowed to have eg
> warnings like -Wuninitialized and maybe -Wmaybe-uninitialized?  Are there more
> warnings that indicate unpredictable code generations due to bad code, or are
> the root cause that these are 'bugs', and we should not allow real bugs at all
> in benchmarking code?

A blanket ban on warnings won't work, they have false positives (especially the
-Wmaybe- one), and there exist code that validly uses uninitialized data. I
don't have such a striking example for scalar variables, but for uninitialized
arrays there's this sparse set algorithm (which GCC itself also uses):
https://research.swtch.com/sparse

I think good benchmarks sets should be able to evolve to account for newly
discovered bugs, rather then remain frozen (which sounds like a reason to
become obsolete sooner rather than later).

Alexander

Reply via email to