Yeah, awesome. On Mon Nov 20, 2023 at 5:10 PM AEST, Michael Ellerman wrote: > Hi Timothy, > > Great work debugging this. I think your fix is good, but I want to understand > it > a bit more to make sure I can explain why we haven't seen it outside of > io-uring.
Analysis seems right to me. Probably the best minimal fix. But I wonder if we should just use the one path for saving/flushing/giving up, just use giveup instead of save? KVM looks odd too, and actually gets this wrong. In a way that's not fixed by Timothy's patch, because it's just not restoring userspace registers at all. Fortunately QEMU isn't in the habit of using non volatile FP/VEC registers over a VCPU ioctl, but there's no reason it couldn't do since GCC/LLVM can easily use them. KVM really wants to be using giveup. Thanks, Nick > If this can be triggered outside of io-uring then I have even more backporting > in my future :} > > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP > regs > from the thread struct before letting the task use FP again. So in that case > save_fpu() is free to clobber fr0 because the FP regs no longer hold live > values > for the task. > > There is another case though, which is the path via: > copy_process() > dup_task_struct() > arch_dup_task_struct() > flush_all_to_thread() > save_all() > > That path saves the FP regs but leaves them live. That's meant as an > optimisation for a process that's using FP/VSX and then calls fork(), leaving > the regs live means the parent process doesn't have to take a fault after the > fork to get its FP regs back. > > That path does clobber fr0, but fr0 is volatile across a syscall, and the only > way to reach copy_process() from userspace is via a syscall. So in normal > usage > fr0 being clobbered across a syscall shouldn't cause data corruption. > > Even if we handle a signal on the return from the fork() syscall, the worst > that > happens is that the task's thread struct holds the clobbered fr0, but the task > doesn't care (because fr0 is volatile across the syscall anyway). > > That path is something like: > > system_call_vectored_common() > system_call_exception() > sys_fork() > kernel_clone() > copy_process() > dup_task_struct() > arch_dup_task_struct() > flush_all_to_thread() > save_all() > if (tsk->thread.regs->msr & MSR_FP) > save_fpu() > # does not clear MSR_FP from regs->msr > syscall_exit_prepare() > interrupt_exit_user_prepare_main() > do_notify_resume() > get_signal() > handle_rt_signal64() > prepare_setup_sigcontext() > flush_fp_to_thread() > if (tsk->thread.regs->msr & MSR_FP) > giveup_fpu() > __giveup_fpu > save_fpu() > # clobbered fr0 is saved, but task considers it volatile > # across syscall anyway > > > But we now have a new path, because io-uring can call copy_process() via > create_io_thread() from the signal handling path. That's OK if the signal is > handled as we return from a syscall, but it's not OK if the signal is handled > due to some other interrupt. > > Which is: > > interrupt_return_srr_user() > interrupt_exit_user_prepare() > interrupt_exit_user_prepare_main() > do_notify_resume() > get_signal() > task_work_run() > create_worker_cb() > create_io_worker() > copy_process() > dup_task_struct() > arch_dup_task_struct() > flush_all_to_thread() > save_all() > if (tsk->thread.regs->msr & MSR_FP) > save_fpu() > # fr0 is clobbered and potentially live in > userspace > > > So tldr I think the corruption is only an issue since io-uring started doing > the clone via signal, which I think matches the observed timeline of this bug > appearing. > > Gotta run home, will have a closer look at the actual patch later on. > > cheers > > > Timothy Pearson <tpear...@raptorengineering.com> writes: > > During floating point and vector save to thread data fr0/vs0 are clobbered > > by the FPSCR/VSCR store routine. This leads to userspace register > > corruption > > and application data corruption / crash under the following rare condition: > > > > * A userspace thread is executing with VSX/FP mode enabled > > * The userspace thread is making active use of fr0 and/or vs0 > > * An IPI is taken in kernel mode, forcing the userspace thread to > > reschedule > > * The userspace thread is interrupted by the IPI before accessing data it > > previously stored in fr0/vs0 > > * The thread being switched in by the IPI has a pending signal > > > > If these exact criteria are met, then the following sequence happens: > > > > * The existing thread FP storage is still valid before the IPI, due to a > > prior call to save_fpu() or store_fp_state(). Note that the current > > fr0/vs0 registers have been clobbered, so the FP/VSX state in registers > > is now invalid pending a call to restore_fp()/restore_altivec(). > > * IPI -- FP/VSX register state remains invalid > > * interrupt_exit_user_prepare_main() calls do_notify_resume(), > > due to the pending signal > > * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which > > merrily reads and saves the invalid FP/VSX state to thread local storage. > > * interrupt_exit_user_prepare_main() calls restore_math(), writing the > > invalid > > FP/VSX state back to registers. > > * Execution is released to userspace, and the application crashes or > > corrupts > > data. > > > > Without the pending signal, do_notify_resume() is never called, therefore > > the > > invalid register state does't matter as it is overwritten nearly immediately > > by interrupt_exit_user_prepare_main() calling restore_math() before return > > to userspace. > > > > Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and > > altivec register save paths. > > > > Tested under QEMU in kvm mode, running on a Talos II workstation with dual > > POWER9 DD2.2 CPUs. > > > > Closes: > > https://lore.kernel.org/all/480932026.45576726.1699374859845.javamail.zim...@raptorengineeringinc.com/ > > Closes: > > https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.javamail.zim...@raptorengineeringinc.com/ > > Tested-by: Timothy Pearson <tpear...@raptorengineering.com> > > Tested-by: Jens Axboe <ax...@kernel.dk> > > Signed-off-by: Timothy Pearson <tpear...@raptorengineering.com> > > --- > > arch/powerpc/kernel/fpu.S | 13 +++++++++++++ > > arch/powerpc/kernel/vector.S | 2 ++ > > 2 files changed, 15 insertions(+) > > > > diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S > > index 6a9acfb690c9..2f8f3f93cbb6 100644 > > --- a/arch/powerpc/kernel/fpu.S > > +++ b/arch/powerpc/kernel/fpu.S > > @@ -23,6 +23,15 @@ > > #include <asm/feature-fixups.h> > > > > #ifdef CONFIG_VSX > > +#define __REST_1FPVSR(n,c,base) > > \ > > +BEGIN_FTR_SECTION \ > > + b 2f; \ > > +END_FTR_SECTION_IFSET(CPU_FTR_VSX); > > \ > > + REST_FPR(n,base); \ > > + b 3f; \ > > +2: REST_VSR(n,c,base); \ > > +3: > > + > > #define __REST_32FPVSRS(n,c,base) \ > > BEGIN_FTR_SECTION \ > > b 2f; \ > > @@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX); > > \ > > 2: SAVE_32VSRS(n,c,base); \ > > 3: > > #else > > +#define __REST_1FPVSR(n,b,base) REST_FPR(n, base) > > #define __REST_32FPVSRS(n,b,base) REST_32FPRS(n, base) > > #define __SAVE_32FPVSRS(n,b,base) SAVE_32FPRS(n, base) > > #endif > > +#define REST_1FPVSR(n,c,base) __REST_1FPVSR(n,__REG_##c,__REG_##base) > > #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base) > > #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base) > > > > @@ -67,6 +78,7 @@ _GLOBAL(store_fp_state) > > SAVE_32FPVSRS(0, R4, R3) > > mffs fr0 > > stfd fr0,FPSTATE_FPSCR(r3) > > + REST_1FPVSR(0, R4, R3) > > blr > > EXPORT_SYMBOL(store_fp_state) > > > > @@ -138,4 +150,5 @@ _GLOBAL(save_fpu) > > 2: SAVE_32FPVSRS(0, R4, R6) > > mffs fr0 > > stfd fr0,FPSTATE_FPSCR(r6) > > + REST_1FPVSR(0, R4, R6) > > blr > > diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S > > index 4094e4c4c77a..80b3f6e476b6 100644 > > --- a/arch/powerpc/kernel/vector.S > > +++ b/arch/powerpc/kernel/vector.S > > @@ -33,6 +33,7 @@ _GLOBAL(store_vr_state) > > mfvscr v0 > > li r4, VRSTATE_VSCR > > stvx v0, r4, r3 > > + lvx v0, 0, r3 > > blr > > EXPORT_SYMBOL(store_vr_state) > > > > @@ -109,6 +110,7 @@ _GLOBAL(save_altivec) > > mfvscr v0 > > li r4,VRSTATE_VSCR > > stvx v0,r4,r7 > > + lvx v0,0,r7 > > blr > > > > #ifdef CONFIG_VSX > > -- > > 2.39.2