[To += Mathieu] "Nysal Jan K.A." <ny...@linux.ibm.com> writes: > From: "Nysal Jan K.A" <ny...@linux.ibm.com> > > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > is not selected, sync_core_before_usermode() is a no-op. > In membarrier_mm_sync_core_before_usermode() the compiler does not > eliminate redundant branches and the load of mm->membarrier_state > for this case as the atomic_read() cannot be optimized away.
I was wondering if this was caused by powerpc's arch_atomic_read() which uses asm volatile. But replacing arch_atomic_read() with READ_ONCE() makes no difference, presumably because the compiler still can't see that the READ_ONCE() is unnecessary (which is kind of by design). > Here's a snippet of the code generated for finish_task_switch() on powerpc: > > 1b786c: ld r26,2624(r30) # mm = rq->prev_mm; > ....... > 1b78c8: cmpdi cr7,r26,0 > 1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0> > 1b78d0: ld r9,2312(r13) # current > 1b78d4: ld r9,1888(r9) # current->mm > 1b78d8: cmpd cr7,r26,r9 > 1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c> > 1b78e0: hwsync > 1b78e4: cmplwi cr7,r27,128 > ....... > 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) > 1b7a74: b 1b78e0 <finish_task_switch+0xcc> > > This was found while analyzing "perf c2c" reports on kernels prior > to commit c1753fd02a00 ("mm: move mm_count into its own cache line") > where mm_count was false sharing with membarrier_state. So it was causing a noticable performance blip? But isn't anymore? > There is a minor improvement in the size of finish_task_switch(). > The following are results from bloat-o-meter: > > GCC 7.5.0: > ---------- > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > Function old new delta > finish_task_switch 884 852 -32 > > GCC 12.2.1: > ----------- > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > Function old new delta > finish_task_switch.isra 852 820 -32 GCC 12 is a couple of years old, I assume GCC 14 behaves similarly? > LLVM 17.0.6: > ------------ > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36) > Function old new delta > rt_mutex_schedule 120 104 -16 > finish_task_switch 792 772 -20 > > Signed-off-by: Nysal Jan K.A <ny...@linux.ibm.com> > --- > include/linux/sched/mm.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 07bb8d4181d7..042e60ab853a 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -540,6 +540,8 @@ enum { > > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct > *mm) > { > + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) > + return; > if (current->mm != mm) > return; > if (likely(!(atomic_read(&mm->membarrier_state) & The other option would be to have a completely separate stub, eg: #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { if (current->mm != mm) return; if (likely(!(atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) return; sync_core_before_usermode(); } #else static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { } #endif Not sure what folks prefer. In either case I think it's probably worth a short comment explaining why it's worth the trouble (ie. that the atomic_read() prevents the compiler from doing DCE). cheers