* H. Peter Anvin <h...@zytor.com> wrote: > On 10/17/2013 01:41 AM, Ingo Molnar wrote: > > > > To correctly simulate the workload you'd have to: > > > > - allocate a buffer larger than your L2 cache. > > > > - to measure the effects of the prefetches you'd also have to randomize > > the individual buffer positions. See how 'perf bench numa' implements a > > random walk via --data_rand_walk, in tools/perf/bench/numa.c. > > Otherwise the CPU might learn your simplistic stream direction and the > > L2 cache might hw-prefetch your data, interfering with any explicit > > prefetches the code does. In many real-life usecases packet buffers are > > scattered. > > > > Also, it would be nice to see standard deviation noise numbers when two > > averages are close to each other, to be able to tell whether differences > > are statistically significant or not. > > > Seriously, though, how much does it matter? All the above seems likely > to do is to drown the signal by adding noise.
I think it matters a lot and I don't think it 'adds' noise - it measures something else (cache cold behavior - which is the common case for first-time csum_partial() use for network packets), which was not measured before, and that that is by its nature has different noise patterns. I've done many cache-cold measurements myself and had no trouble achieving statistically significant results and high precision. > If the parallel (threaded) checksumming is faster, which theory says it > should and microbenchmarking confirms, how important are the > macrobenchmarks? Microbenchmarks can be totally blind to things like the ideal prefetch window size. (or whether a prefetch should be done at all: some CPUs will throw away prefetches if enough regular fetches arrive.) Also, 'naive' single-threaded algorithms can occasionally be better in the cache-cold case because a linear, predictable stream of memory accesses might saturate the memory bus better than a somewhat random looking, interleaved web of accesses that might not harmonize with buffer depths. I _think_ if correctly tuned then the parallel algorithm should be better in the cache cold case, I just don't know with what parameters (and the algorithm has at least one free parameter: the prefetch window size), and I don't know how significant the effect is. Also, more fundamentally, I absolutely detest doing no measurements or measuring the wrong thing - IMHO there are too many 'blind' optimization commits in the kernel with little to no observational data attached. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/