On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote: > > > ----- Original Message ----- > > From: "Michael Ellerman" <m...@ellerman.id.au> > > To: "Timothy Pearson" <tpear...@raptorengineering.com>, "linuxppc-dev" > > <linuxppc-dev@lists.ozlabs.org> > > Cc: "Jens Axboe" <ax...@kernel.dk> > > Sent: Tuesday, November 14, 2023 6:14:37 AM > > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > > > Hi Timothy, > > > > Thanks for debugging this, but I'm unclear why this is helping because > > we should already have a full barrier (hwsync) on both the sending and > > receiving side. > > > > More below. > > I've spent another few days poking at this, and think I might finally have > something more solid in terms of what exactly is happening, but would like > some feedback on the concept / how best to fix the potential problem. > > As background, there are several worker threads both in userspace and in > kernel mode. Crucially, the main MariaDB data processing thread (the one > that handles tasks like flushing dirty pages to disk) always runs on the same > core as the io_uring kernel thread that picks up I/O worker creation requests > and handles them via create_worker_cb(). > > Changes in the ~5.12 era switched away from a delayed worker setup. io_uring > currently sets up the new process with create_io_thread(), and immediately > uses an IPI to forcibly schedule the new process. Because of the way the two > threads interact, the new process ends up grabbing the CPU from the running > MariaDB user thread; I've never seen it schedule on a different core. If the > timing is right in this process, things get trampled on in userspace and the > database server either crashes or throws a corruption fault. > > Through extensive debugging, I've narrowed this down to invalid state in the > VSX registers on return to the MariaDB user thread from the new kernel > thread. For some reason, it seems we don't restore FP state on return from > the PF_IO_WORKER thread, and something in the kernel was busy writing new > data to them. > > A direct example I was able to observe is as follows: > > xxspltd vs0,vs0,0 <-- vs0 now zeroed out > xori r9,r9,1 <-- Presumably we switch to the new kernel thread here > due to the IPI > slwi r9,r9,7 <-- On userspace thread resume, vs0 now contains the > value 0x820040000000000082004000 > xxswapd vs8,vs0 <-- vs8 now has the wrong value > stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on > stw r9,116(r3) > stxvd2x vs8,r3,r0 > ... > CRASH
Nice find, that looks pretty conclusive. > This is a very difficult race to hit, but MariaDB naturally does repetitive > operations with VSX registers so it does eventually fail. I ended up with a > tight loop around glibc operations that use VSX to trigger the failure > reliably enough to even understand what was going on. > > As I am not as familiar with this part of the Linux kernel as with most other > areas, what is the expected save/restore path for the FP/VSX registers around > an IPI and associated forced thread switch? If restore_math() is in the > return path, note that MSR_FP is set in regs->msr. Context switching these FP/vec registers should be pretty robust in general because it's not just io-uring that uses them. io-uring could be using some uncommon kernel code that uses the registers incorrectly though I guess. > > Second question: should we even be using the VSX registers at all in kernel > space? Is this a side effect of io_uring interacting so closely with > userspace threads, or something else entirely? > > If I can get pointed somewhat in the right direction I'm ready to develop the > rest of the fix for this issue...just trying to avoid another several days of > slogging through the source to see what it's supposed to be doing in the > first place. :) Kernel can use FP/VEC/VSX registers but it has to enable and disable explicitly. Such kernel code also should not be preemptible. enable|disable_kernel_fp|altivec|vsx(). Maybe try run the test with ppc_strict_facility_enable boot option to start with. If no luck with that, maybe put WARN_ON(preemptible()); checks also in the disable_kernel_* functions. You could also add an enable/disable counter for each, and make sure it is balanced on context switch or kernel->userspace exit. Thanks, Nick