On 2/16/24 23:49, Alexander Monakov wrote:

On Fri, 16 Feb 2024, Richard Henderson wrote:

Benchmark each acceleration function vs an aligned buffer of zeros.

Signed-off-by: Richard Henderson <richard.hender...@linaro.org>
---
+
+static void test(const void *opaque)
+{
+    size_t len = 64 * KiB;

This exceeds L1 cache capacity, so the performance ceiling of L2 cache
throughput is easier to hit with a suboptimal implementation. It also
seems to vastly exceed typical buffer sizes in Qemu.

When preparing the patch we mostly tested at 8 KiB. The size decides
whether the branch exiting the loop becomes perfectly predictable in
the microbenchmark, e.g. at 128 bytes per iteration it exits on the
63'rd iteration, which Intel predictors cannot track, so we get
one mispredict per call.

(so perhaps smaller sizes like 2 or 4 KiB are better)

Fair.  I've adjusted to loop over 1, 4, 16, 64 KiB.

# Start of bufferiszero tests
# buffer_is_zero #0: 1KB 49227.29 MB/sec
# buffer_is_zero #0: 4KB 137461.28 MB/sec
# buffer_is_zero #0: 16KB 224220.41 MB/sec
# buffer_is_zero #0: 64KB 142461.00 MB/sec
# buffer_is_zero #1: 1KB 45423.59 MB/sec
# buffer_is_zero #1: 4KB 91409.69 MB/sec
# buffer_is_zero #1: 16KB 123819.94 MB/sec
# buffer_is_zero #1: 64KB 71173.75 MB/sec
# buffer_is_zero #2: 1KB 35465.03 MB/sec
# buffer_is_zero #2: 4KB 56110.46 MB/sec
# buffer_is_zero #2: 16KB 68852.28 MB/sec
# buffer_is_zero #2: 64KB 39043.80 MB/sec


r~

Reply via email to