----- Original Message -----
> From: "Timothy Pearson" <tpear...@raptorengineering.com>
> To: "Timothy Pearson" <tpear...@raptorengineering.com>
> Cc: "npiggin" <npig...@gmail.com>, "linuxppc-dev" 
> <linuxppc-dev@lists.ozlabs.org>
> Sent: Friday, November 17, 2023 2:26:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> ----- Original Message -----
>> From: "Timothy Pearson" <tpear...@raptorengineering.com>
>> To: "npiggin" <npig...@gmail.com>
>> Cc: "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
>> Sent: Friday, November 17, 2023 2:20:29 AM
>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> 
>> ----- Original Message -----
>>> From: "npiggin" <npig...@gmail.com>
>>> To: "Timothy Pearson" <tpear...@raptorengineering.com>, "Michael Ellerman"
>>> <m...@ellerman.id.au>
>>> Cc: "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
>>> Sent: Friday, November 17, 2023 2:01:12 AM
>>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>> 
>>> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>> > From: "Michael Ellerman" <m...@ellerman.id.au>
>>>> > To: "Timothy Pearson" <tpear...@raptorengineering.com>, "linuxppc-dev"
>>>> > <linuxppc-dev@lists.ozlabs.org>
>>>> > Cc: "Jens Axboe" <ax...@kernel.dk>
>>>> > Sent: Tuesday, November 14, 2023 6:14:37 AM
>>>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>>>>
>>>> > Hi Timothy,
>>>> > 
>>>> > Thanks for debugging this, but I'm unclear why this is helping because
>>>> > we should already have a full barrier (hwsync) on both the sending and
>>>> > receiving side.
>>>> > 
>>>> > More below.
>>>>
>>>> I've spent another few days poking at this, and think I might finally have
>>>> something more solid in terms of what exactly is happening, but would like 
>>>> some
>>>> feedback on the concept / how best to fix the potential problem.
>>>>
>>>> As background, there are several worker threads both in userspace and in 
>>>> kernel
>>>> mode.  Crucially, the main MariaDB data processing thread (the one that 
>>>> handles
>>>> tasks like flushing dirty pages to disk) always runs on the same core as 
>>>> the
>>>> io_uring kernel thread that picks up I/O worker creation requests and 
>>>> handles
>>>> them via create_worker_cb().
>>>>
>>>> Changes in the ~5.12 era switched away from a delayed worker setup.  
>>>> io_uring
>>>> currently sets up the new process with create_io_thread(), and immediately 
>>>> uses
>>>> an IPI to forcibly schedule the new process.  Because of the way the two
>>>> threads interact, the new process ends up grabbing the CPU from the running
>>>> MariaDB user thread; I've never seen it schedule on a different core.  If 
>>>> the
>>>> timing is right in this process, things get trampled on in userspace and 
>>>> the
>>>> database server either crashes or throws a corruption fault.
>>>>
>>>> Through extensive debugging, I've narrowed this down to invalid state in 
>>>> the VSX
>>>> registers on return to the MariaDB user thread from the new kernel thread. 
>>>>  For
>>>> some reason, it seems we don't restore FP state on return from the 
>>>> PF_IO_WORKER
>>>> thread, and something in the kernel was busy writing new data to them.
>>>>
>>>> A direct example I was able to observe is as follows:
>>>>
>>>> xxspltd vs0,vs0,0      <-- vs0 now zeroed out
>>>> xori    r9,r9,1        <-- Presumably we switch to the new kernel thread 
>>>> here
>>>> due to the IPI
>>>> slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the
>>>> value 0x820040000000000082004000
>>>> xxswapd vs8,vs0        <-- vs8 now has the wrong value
>>>> stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
>>>> stw     r9,116(r3)
>>>> stxvd2x vs8,r3,r0
>>>> ...
>>>> CRASH
>>> 
>>> Nice find, that looks pretty conclusive.
>>> 
>>>> This is a very difficult race to hit, but MariaDB naturally does repetitive
>>>> operations with VSX registers so it does eventually fail.  I ended up with 
>>>> a
>>>> tight loop around glibc operations that use VSX to trigger the failure 
>>>> reliably
>>>> enough to even understand what was going on.
>>>>
>>>> As I am not as familiar with this part of the Linux kernel as with most 
>>>> other
>>>> areas, what is the expected save/restore path for the FP/VSX registers 
>>>> around
>>>> an IPI and associated forced thread switch?  If restore_math() is in the 
>>>> return
>>>> path, note that MSR_FP is set in regs->msr.
>>> 
>>> Context switching these FP/vec registers should be pretty robust in
>>> general because it's not just io-uring that uses them. io-uring could
>>> be using some uncommon kernel code that uses the registers incorrectly
>>> though I guess.
>>> 
>>>>
>>>> Second question: should we even be using the VSX registers at all in kernel
>>>> space?  Is this a side effect of io_uring interacting so closely with 
>>>> userspace
>>>> threads, or something else entirely?
>>>>
>>>> If I can get pointed somewhat in the right direction I'm ready to develop 
>>>> the
>>>> rest of the fix for this issue...just trying to avoid another several days 
>>>> of
>>>> slogging through the source to see what it's supposed to be doing in the 
>>>> first
>>>> place. :)
>>> 
>>> Kernel can use FP/VEC/VSX registers but it has to enable and disable
>>> explicitly. Such kernel code also should not be preemptible.
>>> 
>>> enable|disable_kernel_fp|altivec|vsx().
>>> 
>>> Maybe try run the test with ppc_strict_facility_enable boot option to
>>> start with.
>>> 
>>> If no luck with that, maybe put WARN_ON(preemptible()); checks also in
>>> the disable_kernel_* functions.
>>> 
>>> You could also add an enable/disable counter for each, and make sure it
>>> is balanced on context switch or kernel->userspace exit.
>>> 
>>> Thanks,
>>> Nick
>> 
>> Will do, thanks for the hints!
>> 
>> I had a debug idea just as I sent the earlier message, and was able to 
>> confirm
>> the kernel is purposefully restoring the bad data in at least one spot in the
>> thread's history, though curiously *not* right before everything goes off the
>> rails.  I also dumped the entire kernel binary and confirmed it isn't 
>> touching
>> the vs* registers, so overall this is leaning more toward a bad value being
>> restored than kernel code inadvertently making use of the vector registers.
>> 
>> An I missing anything other than do_restore_fp() that could touch the vs*
>> registers around an interrupt-driven task switch?
> 
> One other piece of this puzzle -- I'm running this via qemu in kvm mode.  I
> noticed there is some code that touches FP state in the kvm tree, any way we
> could be having a problem lower down the stack (i.e. at hypervisor level) that
> would manifest this way in the guest?
> 
> I can try to test on bare metal tomorrow to rule that out one way or the 
> other.

Just tried on bare metal, same corruption issues remain so I'm going to assume 
qemu / kvm is not the issue here.

Reply via email to