On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>
>
> ----- Original Message -----
> > From: "Michael Ellerman" <m...@ellerman.id.au>
> > To: "Timothy Pearson" <tpear...@raptorengineering.com>, "linuxppc-dev" 
> > <linuxppc-dev@lists.ozlabs.org>
> > Cc: "Jens Axboe" <ax...@kernel.dk>
> > Sent: Tuesday, November 14, 2023 6:14:37 AM
> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>
> > Hi Timothy,
> > 
> > Thanks for debugging this, but I'm unclear why this is helping because
> > we should already have a full barrier (hwsync) on both the sending and
> > receiving side.
> > 
> > More below.
>
> I've spent another few days poking at this, and think I might finally have 
> something more solid in terms of what exactly is happening, but would like 
> some feedback on the concept / how best to fix the potential problem.
>
> As background, there are several worker threads both in userspace and in 
> kernel mode.  Crucially, the main MariaDB data processing thread (the one 
> that handles tasks like flushing dirty pages to disk) always runs on the same 
> core as the io_uring kernel thread that picks up I/O worker creation requests 
> and handles them via create_worker_cb().
>
> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring 
> currently sets up the new process with create_io_thread(), and immediately 
> uses an IPI to forcibly schedule the new process.  Because of the way the two 
> threads interact, the new process ends up grabbing the CPU from the running 
> MariaDB user thread; I've never seen it schedule on a different core.  If the 
> timing is right in this process, things get trampled on in userspace and the 
> database server either crashes or throws a corruption fault.
>
> Through extensive debugging, I've narrowed this down to invalid state in the 
> VSX registers on return to the MariaDB user thread from the new kernel 
> thread.  For some reason, it seems we don't restore FP state on return from 
> the PF_IO_WORKER thread, and something in the kernel was busy writing new 
> data to them.
>
> A direct example I was able to observe is as follows:
>
> xxspltd vs0,vs0,0      <-- vs0 now zeroed out
> xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here 
> due to the IPI
> slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the 
> value 0x820040000000000082004000
> xxswapd vs8,vs0        <-- vs8 now has the wrong value
> stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
> stw     r9,116(r3)
> stxvd2x vs8,r3,r0
> ...
> CRASH

Nice find, that looks pretty conclusive.

> This is a very difficult race to hit, but MariaDB naturally does repetitive 
> operations with VSX registers so it does eventually fail.  I ended up with a 
> tight loop around glibc operations that use VSX to trigger the failure 
> reliably enough to even understand what was going on.
>
> As I am not as familiar with this part of the Linux kernel as with most other 
> areas, what is the expected save/restore path for the FP/VSX registers around 
> an IPI and associated forced thread switch?  If restore_math() is in the 
> return path, note that MSR_FP is set in regs->msr.

Context switching these FP/vec registers should be pretty robust in
general because it's not just io-uring that uses them. io-uring could
be using some uncommon kernel code that uses the registers incorrectly
though I guess.

>
> Second question: should we even be using the VSX registers at all in kernel 
> space?  Is this a side effect of io_uring interacting so closely with 
> userspace threads, or something else entirely?
>
> If I can get pointed somewhat in the right direction I'm ready to develop the 
> rest of the fix for this issue...just trying to avoid another several days of 
> slogging through the source to see what it's supposed to be doing in the 
> first place. :)

Kernel can use FP/VEC/VSX registers but it has to enable and disable
explicitly. Such kernel code also should not be preemptible.

enable|disable_kernel_fp|altivec|vsx().

Maybe try run the test with ppc_strict_facility_enable boot option to
start with.

If no luck with that, maybe put WARN_ON(preemptible()); checks also in
the disable_kernel_* functions.

You could also add an enable/disable counter for each, and make sure it
is balanced on context switch or kernel->userspace exit.

Thanks,
Nick

Reply via email to