On Tue, 20 Mar 2018, Ingo Molnar wrote: > * Thomas Gleixner <t...@linutronix.de> wrote: > > > > Useful also for code that needs AVX-like registers to do things like CRCs. > > > > x86/crypto/ has a lot of AVX optimized code. > > Yeah, that's true, but the crypto code is processing fundamentally bigger > blocks > of data, which amortizes the cost of using kernel_fpu_begin()/_end().
Correct. > So assuming the target driver will only load on modern FPUs I *think* it > should > actually be possible to do something like (pseudocode): > > vmovdqa %ymm0, 40(%rsp) > vmovdqa %ymm1, 80(%rsp) > > ... > # use ymm0 and ymm1 > ... > > vmovdqa 80(%rsp), %ymm1 > vmovdqa 40(%rsp), %ymm0 > > ... without using the heavy XSAVE/XRSTOR instructions. > > Note that preemption probably still needs to be disabled and possibly there > are > other details as well, but there should be no 'heavy' FPU operations. Emphasis on should :) > I think this should still preserve all user-space FPU state and shouldn't > muck up > any 'weird' user-space FPU state (such as pending exceptions, legacy x87 > running > code, NaN registers or weird FPU control word settings) we might have > interrupted > either. > > But I could be wrong, it should be checked whether this sequence is safe. > Worst-case we might have to save/restore the FPU control and tag words - but > those > operations should still be much faster than a full XSAVE/XRSTOR pair. Fair enough. > So I do think we could do more in this area to improve driver performance, if > the > code is correct and if there's actual benchmarks that are showing real > benefits. If it's about hotpath performance I'm all for it, but the use case here is a debug facility... And if we go down that road then we want a AVX based memcpy() implementation which is runtime conditional on the feature bit(s) and length dependent. Just slapping a readqq() at it and use it in a loop does not make any sense. Thanks, tglx