From: Gyutae Bae <[email protected]>
This series adds an atomic compare-and-delete primitive to BPF hash
maps, motivated by a TOCTOU race in Cilium's conntrack GC [1]: the
batched GC snapshots CT entries, decides which expired, then deletes
them by key in a later syscall; between snapshot and delete the
datapath can refresh the same entry, so a live entry is deleted. A
userspace re-check before delete can't close it (lookup and delete are
separate, individually bucket-locked calls).
BPF_F_COMPARE lets userspace delete a key only if a chosen value region
is unchanged, with the compare and the delete done atomically under the
hash bucket lock:
attr.flags |= BPF_F_COMPARE;
attr.compare = <expected>;
attr.compare_offset = <off>;
attr.compare_size = <len>;
mismatch -> -EBUSY, absent -> -ENOENT, unsupported map -> -EOPNOTSUPP.
The compare* fields without the flag are rejected (-EINVAL) so a dropped
flag can't silently become an unconditional delete; maps whose value
carries BTF-managed fields (spin_lock/timer/kptr/...) are rejected
(-EOPNOTSUPP) since those bytes are sanitised on lookup.
Atomicity boundary (please scrutinise): the compare is atomic vs every
bucket-lock holder, but NOT vs a BPF program writing the value in place
via the pointer from bpf_map_lookup_elem() (no bucket lock). It
collapses the race window from the whole GC batch to one bucket-locked
critical section; full closure wants the compared region treated as a
synchronization variable (e.g. a monotonic revision). The selftest
models this.
Scope of this RFC: per-element compare-and-delete on BPF_MAP_TYPE_HASH
only. Deferred (will follow once the approach is agreed): batch delete +
its attr fields, a libbpf wrapper, LRU-hash and other map types, a
compare-and-swap *update*.
Open questions:
- flag name: BPF_F_COMPARE vs something else?
- mismatch errno: -EBUSY vs -EAGAIN?
- new ->map_delete_elem_cmp() op vs extending ->map_delete_elem?
[1] https://github.com/cilium/cilium/issues/46298
Gyutae Bae (3):
bpf: add BPF_F_COMPARE flag and compare fields to map elem UAPI
bpf: implement compare-and-delete (BPF_F_COMPARE) for
BPF_MAP_TYPE_HASH
selftests/bpf: test BPF_F_COMPARE compare-and-delete
include/linux/bpf.h | 2 +
include/uapi/linux/bpf.h | 6 +-
kernel/bpf/hashtab.c | 39 +++++++
kernel/bpf/syscall.c | 54 ++++++++-
tools/include/uapi/linux/bpf.h | 6 +-
.../selftests/bpf/prog_tests/map_cmp_delete.c | 106 ++++++++++++++++++
6 files changed, 208 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/map_cmp_delete.c
base-commit: a975094bf98ca97be9146f9d3b5681a6f9cf5ce3
--
2.39.5 (Apple Git-154)