On 2025-12-18 03:36, Boqun Feng wrote:
On Wed, Dec 17, 2025 at 08:45:30PM -0500, Mathieu Desnoyers wrote:
[...]
+static inline
+void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
+{
+ struct hazptr_slot *slot = NULL;
+ void *addr, *addr2;
+
+ /*
+ * Load @addr_p to know which address should be protected.
+ */
+ addr = READ_ONCE(*addr_p);
+ for (;;) {
+ if (!addr)
+ return NULL;
+ guard(preempt)();
+ if (likely(!hazptr_slot_is_backup(ctx, slot))) {
+ slot = hazptr_get_free_percpu_slot();
I need to continue share my concerns about this "allocating slot while
protecting" pattern. Here realistically, we will go over a few of the
per-CPU hazard pointer slots *every time* instead of directly using a
pre-allocated hazard pointer slot.
No, that's not the expected fast-path with CONFIG_PREEMPT_HAZPTR=y
(introduced in patch 4/4).
With PREEMPT_HAZPTR, using more than one hazard pointer per CPU will
only happen if there are nested hazard pointer users, which can happen
due to:
- Holding a hazard pointer across function calls, where another hazard
pointer is used.
- Using hazard pointers from interrupt handlers (note: my current code
only does preempt disable, not irq disable, this is something I'd need
to change if we wish to acquire/release hazard pointers from interrupt
handlers). But even that should be a rare event.
So the fast-path has an initial state where there are no hazard pointers
in use on the CPU, which means hazptr_acquire() finds its empty slot at
index 0.
Could you utilize this[1] to see a
comparison of the reader-side performance against RCU/SRCU?
Good point ! Let's see.
On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
hyperthreading disabled,
CONFIG_PREEMPT=y,
CONFIG_PREEMPT_RCU=y,
CONFIG_PREEMPT_HAZPTR=y.
scale_type ns
-----------------------
hazptr-smp-mb 13.1 <- this implementation
hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(),
requires IPIs on synchronize.
hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu
overflow lists by hlist.
rcu 17.0
srcu 20.0
srcu-fast 1.5
rcu-tasks 0.0
rcu-trace 1.7
refcnt 1148.0
rwlock 1190.0
rwsem 4199.3
lock 41070.6
lock-irq 46176.3
acqrel 1.1
So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
appear to beat hazptr read-side performance.
[...]
+/*
+ * Perform piecewise iteration on overflow list waiting until "addr" is
+ * not present. Raw spinlock is released and taken between each list
+ * item and busy loop iteration. The overflow list generation is checked
+ * each time the lock is taken to validate that the list has not changed
+ * before resuming iteration or busy wait. If the generation has
+ * changed, retry the entire list traversal.
+ */
+static
+void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list,
void *addr)
+{
+ struct hazptr_backup_slot *backup_slot;
+ uint64_t snapshot_gen;
+
+ raw_spin_lock(&overflow_list->lock);
+retry:
+ snapshot_gen = overflow_list->gen;
+ list_for_each_entry(backup_slot, &overflow_list->head, node) {
+ /* Busy-wait if node is found. */
+ while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /*
Load B */
+ raw_spin_unlock(&overflow_list->lock);
+ cpu_relax();
I think we should prioritize the scan thread solution [2] instead of
busy waiting hazrd pointer updaters, because when we have multiple
hazard pointer usages we would want to consolidate the scans from
updater side.
I agree that batching scans with a worker thread is a logical next step.
If so, the whole ->gen can be avoided.
How would it allow removing the generation trick without causing long
raw spinlock latencies ?
However this ->gen idea does seem ot resolve another issue for me, I'm
trying to make shazptr critical section preemptive by using a per-task
backup slot (if you recall, this is your idea from the hallway
discussions we had during LPC 2024),
I honestly did not remember. It's been a whole year! ;-)
and currently I could not make it
work because the following sequeue:
1. CPU 0 already has one pointer protected.
2. CPU 1 begins the updater scan, and it scans the list of preempted
hazard pointer readers, no reader.
3. CPU 0 does a context switch, it stores the current hazard pointer
value to the current task's ->hazard_slot (let's say the task is task
A), and add it to the list of preempted hazard pointer readers.
4. CPU 0 clears its percpu hazptr_slots for the next task (B).
5. CPU 1 continues the updater scan, and it scans the percpu slot of
CPU 0, and finds no reader.
in this situation, updater will miss a reader. But if we add a
generation snapshotting at step 2 and generation increment at step 3, I
think it'll work.
IMO, if we make this work, it's better than the current backup slot
mechanism IMO, because we only need to acquire the lock if context
switch happens.
With PREEMPT_HAZPTR we also only need to acquire the per-cpu overflow
list raw spinlock on context switch (preemption or blocking). The only
other case requiring it is hazptr nested usage (more than 8 active
hazptr) on a thread context + nested irqs.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com