----- Original Message -----
> From: "Timothy Pearson" <tpear...@raptorengineering.com>
> To: "npiggin" <npig...@gmail.com>
> Cc: "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Friday, November 17, 2023 2:20:29 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> ----- Original Message -----
>> From: "npiggin" <npig...@gmail.com>
>> To: "Timothy Pearson" <tpear...@raptorengineering.com>, "Michael Ellerman"
>> <m...@ellerman.id.au>
>> Cc: "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
>> Sent: Friday, November 17, 2023 2:01:12 AM
>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>
>> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>>>
>>>
>>> ----- Original Message -----
>>> > From: "Michael Ellerman" <m...@ellerman.id.au>
>>> > To: "Timothy Pearson" <tpear...@raptorengineering.com>, "linuxppc-dev"
>>> > <linuxppc-dev@lists.ozlabs.org>
>>> > Cc: "Jens Axboe" <ax...@kernel.dk>
>>> > Sent: Tuesday, November 14, 2023 6:14:37 AM
>>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>>>
>>> > Hi Timothy,
>>> >
>>> > Thanks for debugging this, but I'm unclear why this is helping because
>>> > we should already have a full barrier (hwsync) on both the sending and
>>> > receiving side.
>>> >
>>> > More below.
>>>
>>> I've spent another few days poking at this, and think I might finally have
>>> something more solid in terms of what exactly is happening, but would like
>>> some
>>> feedback on the concept / how best to fix the potential problem.
>>>
>>> As background, there are several worker threads both in userspace and in
>>> kernel
>>> mode. Crucially, the main MariaDB data processing thread (the one that
>>> handles
>>> tasks like flushing dirty pages to disk) always runs on the same core as the
>>> io_uring kernel thread that picks up I/O worker creation requests and
>>> handles
>>> them via create_worker_cb().
>>>
>>> Changes in the ~5.12 era switched away from a delayed worker setup.
>>> io_uring
>>> currently sets up the new process with create_io_thread(), and immediately
>>> uses
>>> an IPI to forcibly schedule the new process. Because of the way the two
>>> threads interact, the new process ends up grabbing the CPU from the running
>>> MariaDB user thread; I've never seen it schedule on a different core. If
>>> the
>>> timing is right in this process, things get trampled on in userspace and the
>>> database server either crashes or throws a corruption fault.
>>>
>>> Through extensive debugging, I've narrowed this down to invalid state in
>>> the VSX
>>> registers on return to the MariaDB user thread from the new kernel thread.
>>> For
>>> some reason, it seems we don't restore FP state on return from the
>>> PF_IO_WORKER
>>> thread, and something in the kernel was busy writing new data to them.
>>>
>>> A direct example I was able to observe is as follows:
>>>
>>> xxspltd vs0,vs0,0 <-- vs0 now zeroed out
>>> xori r9,r9,1 <-- Presumably we switch to the new kernel thread
>>> here
>>> due to the IPI
>>> slwi r9,r9,7 <-- On userspace thread resume, vs0 now contains the
>>> value 0x820040000000000082004000
>>> xxswapd vs8,vs0 <-- vs8 now has the wrong value
>>> stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on
>>> stw r9,116(r3)
>>> stxvd2x vs8,r3,r0
>>> ...
>>> CRASH
>>
>> Nice find, that looks pretty conclusive.
>>
>>> This is a very difficult race to hit, but MariaDB naturally does repetitive
>>> operations with VSX registers so it does eventually fail. I ended up with a
>>> tight loop around glibc operations that use VSX to trigger the failure
>>> reliably
>>> enough to even understand what was going on.
>>>
>>> As I am not as familiar with this part of the Linux kernel as with most
>>> other
>>> areas, what is the expected save/restore path for the FP/VSX registers
>>> around
>>> an IPI and associated forced thread switch? If restore_math() is in the
>>> return
>>> path, note that MSR_FP is set in regs->msr.
>>
>> Context switching these FP/vec registers should be pretty robust in
>> general because it's not just io-uring that uses them. io-uring could
>> be using some uncommon kernel code that uses the registers incorrectly
>> though I guess.
>>
>>>
>>> Second question: should we even be using the VSX registers at all in kernel
>>> space? Is this a side effect of io_uring interacting so closely with
>>> userspace
>>> threads, or something else entirely?
>>>
>>> If I can get pointed somewhat in the right direction I'm ready to develop
>>> the
>>> rest of the fix for this issue...just trying to avoid another several days
>>> of
>>> slogging through the source to see what it's supposed to be doing in the
>>> first
>>> place. :)
>>
>> Kernel can use FP/VEC/VSX registers but it has to enable and disable
>> explicitly. Such kernel code also should not be preemptible.
>>
>> enable|disable_kernel_fp|altivec|vsx().
>>
>> Maybe try run the test with ppc_strict_facility_enable boot option to
>> start with.
>>
>> If no luck with that, maybe put WARN_ON(preemptible()); checks also in
>> the disable_kernel_* functions.
>>
>> You could also add an enable/disable counter for each, and make sure it
>> is balanced on context switch or kernel->userspace exit.
>>
>> Thanks,
>> Nick
>
> Will do, thanks for the hints!
>
> I had a debug idea just as I sent the earlier message, and was able to confirm
> the kernel is purposefully restoring the bad data in at least one spot in the
> thread's history, though curiously *not* right before everything goes off the
> rails. I also dumped the entire kernel binary and confirmed it isn't touching
> the vs* registers, so overall this is leaning more toward a bad value being
> restored than kernel code inadvertently making use of the vector registers.
>
> An I missing anything other than do_restore_fp() that could touch the vs*
> registers around an interrupt-driven task switch?
One other piece of this puzzle -- I'm running this via qemu in kvm mode. I
noticed there is some code that touches FP state in the kvm tree, any way we
could be having a problem lower down the stack (i.e. at hypervisor level) that
would manifest this way in the guest?
I can try to test on bare metal tomorrow to rule that out one way or the other.