On 18/07/2025 17.05, Matt Fleming wrote:
From: Matt Fleming <mflem...@cloudflare.com>

Add benchmarks for the standard set of operations: lookup, update,
delete. Also, include a benchmark for trie_free() which is known to have
terrible performance for maps with many entries.

Benchmarks operate on tries without gaps in the key range, i.e. each
test begins with a trie with valid keys in the range [0, nr_entries).
This is intended to cause maximum branching when traversing the trie.

All measurements are recorded inside the kernel to remove syscall
overhead.

Most benchmarks run an XDP program to generate stats but free needs to
collect latencies using fentry/fexit on map_free_deferred() because it's
not possible to use fentry directly on lpm_trie.c since commit
c83508da5620 ("bpf: Avoid deadlock caused by nested kprobe and fentry
bpf programs") and there's no way to create/destroy a map from within an
XDP program.

Here is example output from an AMD EPYC 9684X 96-Core machine for each
of the benchmarks using a trie with 10K entries and a 32-bit prefix
length, e.g.

   $ ./bench lpm-trie-$op \
        --prefix_len=32  \
        --producers=1     \
        --nr_entries=10000

   lookup: throughput    7.423 ± 0.023 M ops/s (  7.423M ops/prod), latency  
134.710 ns/op
   update: throughput    2.643 ± 0.015 M ops/s (  2.643M ops/prod), latency  
378.310 ns/op
   delete: throughput    0.712 ± 0.008 M ops/s (  0.712M ops/prod), latency 
1405.152 ns/op
     free: throughput    0.574 ± 0.003 K ops/s (  0.574K ops/prod), latency    
1.743 ms/op

Signed-off-by: Matt Fleming <mflem...@cloudflare.com>
---
  tools/testing/selftests/bpf/Makefile          |   2 +
  tools/testing/selftests/bpf/bench.c           |  10 +
  tools/testing/selftests/bpf/bench.h           |   1 +
  .../selftests/bpf/benchs/bench_lpm_trie_map.c | 345 ++++++++++++++++++
  .../selftests/bpf/progs/lpm_trie_bench.c      | 175 +++++++++
  .../selftests/bpf/progs/lpm_trie_map.c        |  19 +
  6 files changed, 552 insertions(+)
  create mode 100644 tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c
  create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_bench.c
  create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_map.c


I've already tested + reviewed this and different version of this benchmark during internal development. Thanks to Matt for working on this.

Tested-by: Jesper Dangaard Brouer <h...@kernel.org>

You can add my reviewed by when we resolve below comment.

Reviewed-by: Jesper Dangaard Brouer <h...@kernel.org>


[...]
diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_bench.c 
b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c
new file mode 100644
index 000000000000..c335718cc240
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c
@@ -0,0 +1,175 @@
[...]
+
+static __always_inline void atomic_inc(long *cnt)
+{
+       __atomic_add_fetch(cnt, 1, __ATOMIC_SEQ_CST);
+}
+
+static __always_inline long atomic_swap(long *cnt, long val)
+{
+       return __atomic_exchange_n(cnt, val, __ATOMIC_SEQ_CST);
+}

For userspace includes we have similar defines in bench.h.
Except they use __ATOMIC_RELAXED and here __ATOMIC_SEQ_CST.
Which is the correct to use?

For BPF kernel-side do selftests have another header file that define
these `atomic_inc` and `atomic_swap` ?

--Jesper




Reply via email to