On Fri, Jun 15, 2018 at 8:53 PM Andy Lutomirski <l...@kernel.org> wrote: > > On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen > <dave.han...@linux.intel.com> wrote: > Even with the modified optimization, kernel_fpu_end() still needs to > reload the state that was trashed by the kernel FPU use. If the > kernel is using something like AVX512 state, then kernel_fpu_end() > will transfer an enormous amount of data no matter how clever the CPU > is. And I think I once measured XSAVEOPT taking a hundred cycles or > so even when RFBM==0, so it's not exactly super fast.
Indeed the speed up is really significant, especially for the AVX512 case. Here are some numbers from my laptop and a server taken a few seconds ago: AVX2 - Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz Inside: 684617437 Outside: 547710093 Percent speedup: 24 AVX512 - Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz Inside: 634415672 Outside: 286698960 Percent speedup: 121 This is from this test -- https://xn--4db.cc/F7RF2fhv/c . There are probably various issues with that test case, and it's possible there are other effects going on (the avx512 case looks particularly insane) to make the difference _that_ drastic, but I think there's no doubt that the optimization here is a meaningful one.