> Date: Mon, 22 Jun 2020 18:45:47 +0000 (UTC) > From: Eduardo Horvath <e...@netbsd.org> > > I think this is sort of a half-measure since it restricts > coprocessor usage to a few threads. If you want to say, implement > the kenrel memcopy using vector registers (the way sparc64 does) > this doesn't help and may end up getting in the way.
Why do you think this restricts it to a few threads or gets in the way of anything? As I wrote in my original message: That way, for example, you can use (say) an AES encryption routine aes_enc as a subroutine anywhere in the kernel, and an MD definition of aes_enc can internally use AES-NI with the appropriate MD fpu_kern_enter -- but it's a little cheaper to use aes_enc in an FPU-enabled kthread. This gave a modest measurable boost to cgd(4) throughput in my preliminary experiments. Note that the subroutine (here aes_enc, but it could in principle be memcpy too) works `anywhere in the kernel', not just restricted to a few threads. The definition of aes_enc with AES-NI CPU instructions on x86 already works (https://mail-index.netbsd.org/tech-kern/2020/06/18/msg026505.html for details); just putting kthread_fpu_enter/exit around cgd_process in cgd.c improved throughput on a RAM-backed disk by about 20% (presumably mostly because it avoids zeroing the fpu registers on every aes_* call in that thread). > I'd do something simpler such as adding a MI routine to allocate or > activate a temporary or permanent register save area that can be used by > kernel threads. > > Then, if you want, in the coprocessor trap handler, if you want, if you > are in kernel state you can check whether a kernel save area has been > allocated and panic if not. This sounds like a plausible alternative to disabling kpreemption in some cases, but it is also orthogonal to my proposal -- in an FPU-enabled kthread there is simply no need to allocate an extra save area at all because it's already allocated in the lwp pcb, so if a subroutine does use the FPU then it's cheaper to call that subroutine in an FPU-enabled kthread than otherwise. You say it would be simpler -- can you elaborate on how it would simplify the implementations that already work on x86 and aarch64 by just adding and testing a new flag in a couple places, and enabling or disabling the CPU's FPU-enable bit? https://anonhg.netbsd.org/src-all/rev/e83ef87e4f53 https://anonhg.netbsd.org/src-all/rev/7ec4225df101