From: Gyutae Bae <[email protected]>

This series adds an atomic compare-and-delete primitive to BPF hash
maps, motivated by a TOCTOU race in Cilium's conntrack GC [1]: the
batched GC snapshots CT entries, decides which expired, then deletes
them by key in a later syscall; between snapshot and delete the
datapath can refresh the same entry, so a live entry is deleted. A
userspace re-check before delete can't close it (lookup and delete are
separate, individually bucket-locked calls).

BPF_F_COMPARE lets userspace delete a key only if a chosen value region
is unchanged, with the compare and the delete done atomically under the
hash bucket lock:

    attr.flags |= BPF_F_COMPARE;
    attr.compare = <expected>;
    attr.compare_offset = <off>;
    attr.compare_size = <len>;

mismatch -> -EBUSY, absent -> -ENOENT, unsupported map -> -EOPNOTSUPP.
The compare* fields without the flag are rejected (-EINVAL) so a dropped
flag can't silently become an unconditional delete; maps whose value
carries BTF-managed fields (spin_lock/timer/kptr/...) are rejected
(-EOPNOTSUPP) since those bytes are sanitised on lookup.

Atomicity boundary (please scrutinise): the compare is atomic vs every
bucket-lock holder, but NOT vs a BPF program writing the value in place
via the pointer from bpf_map_lookup_elem() (no bucket lock). It
collapses the race window from the whole GC batch to one bucket-locked
critical section; full closure wants the compared region treated as a
synchronization variable (e.g. a monotonic revision). The selftest
models this.

Scope of this RFC: per-element compare-and-delete on BPF_MAP_TYPE_HASH
only. Deferred (will follow once the approach is agreed): batch delete +
its attr fields, a libbpf wrapper, LRU-hash and other map types, a
compare-and-swap *update*.

Open questions:
  - flag name: BPF_F_COMPARE vs something else?
  - mismatch errno: -EBUSY vs -EAGAIN?
  - new ->map_delete_elem_cmp() op vs extending ->map_delete_elem?

[1] https://github.com/cilium/cilium/issues/46298

Gyutae Bae (3):
  bpf: add BPF_F_COMPARE flag and compare fields to map elem UAPI
  bpf: implement compare-and-delete (BPF_F_COMPARE) for
    BPF_MAP_TYPE_HASH
  selftests/bpf: test BPF_F_COMPARE compare-and-delete

 include/linux/bpf.h                           |   2 +
 include/uapi/linux/bpf.h                      |   6 +-
 kernel/bpf/hashtab.c                          |  39 +++++++
 kernel/bpf/syscall.c                          |  54 ++++++++-
 tools/include/uapi/linux/bpf.h                |   6 +-
 .../selftests/bpf/prog_tests/map_cmp_delete.c | 106 ++++++++++++++++++
 6 files changed, 208 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_cmp_delete.c


base-commit: a975094bf98ca97be9146f9d3b5681a6f9cf5ce3
-- 
2.39.5 (Apple Git-154)


Reply via email to