On 2026/2/2 23:09, Peter Zijlstra wrote:
On Mon, Feb 02, 2026 at 10:37:39PM +0800, Lance Yang wrote:
On 2026/2/2 21:37, Peter Zijlstra wrote:
On Mon, Feb 02, 2026 at 09:07:10PM +0800, Lance Yang wrote:
Right, but if we can use full RCU for PT_RECLAIM, why can't we do so
unconditionally and not add overhead?
The sync (IPI) is mainly needed for unshare (e.g. hugetlb) and collapse
(khugepaged) paths, regardless of whether table free uses RCU, IIUC.
In addition: We need the sync when we modify page tables (e.g. unshare,
collapse), not only when we free them. RCU can defer freeing but does
not prevent lockless walkers from seeing concurrent in-place
modifications, so we need the IPI to synchronize with those walkers
first.
Currently PT_RECLAIM=y has no IPI; are you saying that is broken? If
not, then why do we need this at all?
PT_RECLAIM=y does have IPI for unshare/collapse — those paths call
tlb_flush_unshared_tables() (for hugetlb unshare) and collapse_huge_page()
(in khugepaged collapse), which already send IPIs today (broadcast to all
CPUs via tlb_remove_table_sync_one()).
What PT_RECLAIM=y doesn't need IPI for is table freeing (
__tlb_remove_table_one() uses call_rcu() instead). But table modification
(unshare, collapse) still needs IPI to synchronize with lockless walkers,
regardless of PT_RECLAIM.
So PT_RECLAIM=y is not broken; it already has IPI where needed. This series
just makes those IPIs targeted instead of broadcast. Does that clarify?
Oh bah, reading is hard. I had missed they had more table_sync_one() calls,
rather than remove_table_one().
So you *can* replace table_sync_one() with rcu_sync(), that will provide
the same guarantees. Its just a 'little' bit slower on the update side,
but does not incur the read side cost.
Yep, we could replace the IPI with synchronize_rcu() on the sync side:
- Currently: TLB flush → send IPI → wait for walkers to finish
- With synchronize_rcu(): TLB flush → synchronize_rcu() -> waits for
grace period
Lockless walkers (e.g. GUP-fast) use local_irq_disable();
synchronize_rcu() also
waits for regions with preemption/interrupts disabled, so it should
work, IIUC.
And then, the trade-off would be:
- Read side: zero cost (no per-CPU tracking)
- Write side: wait for RCU grace period (potentially slower)
For collapse/unshare, that write-side latency might be acceptable :)
@David, what do you think?
I really think anything here needs to better explain the various
requirements. Because now everybody gets to pay the price for hugetlb
shared crud, while 'nobody' will actually use that.
Right. If we go with synchronize_rcu(), the read-side cost goes away ...
Thanks,
Lance