On Thu, Jan 7, 2021 at 5:32 AM kernel test robot <oliver.s...@intel.com> wrote: > > FYI, we noticed a -5.8% regression of will-it-scale.per_thread_ops due to > commit:
Ok, that's noticeable. And: > commit: d55564cfc222326e944893eff0c4118353e349ec ("x86: Make __put_user() > generate an out-of-line call") Yeah, that wasn't supposed to cause any performance regressions. No core code should use __put_user() so much. But: > | testcase: change | will-it-scale: will-it-scale.per_process_ops -7.3% > regression | > | test machine | 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz > with 192G memory | > | test parameters | cpufreq_governor=performance > | > | | mode=process > | > | | nr_task=100% > | > | | test=poll2 > | > | | ucode=0x16 > | Ok, it's poll(), and it's definitely the __put_user() there: > 0.00 +1.8 1.77 ą 3% > perf-profile.children.cycles-pp.__put_user_nocheck_2 > 0.00 +1.6 1.64 ą 3% > perf-profile.self.cycles-pp.__put_user_nocheck_2 And in fact, it's that final "write back the 16-bit revents field" at the end. Which must have sucked before too, because it used to do a "stac/clac" for every word - but now it does it out of line. The fix is to convert that loop to use "unsafe_put_user()" with the necessary accoutrements around it, and that should speed things up quite nicely. The (double) loop itself is actually just 14 instructions, it's ridiculous how bad the code used to be, and how much better it is with the nice unsafe_put_user(). The whole double loop ends up being just lea 0x68(%rsp),%rsi mov %rcx,%rax 1: mov 0x8(%rsi),%ecx lea 0xc(%rsi),%rdx test %ecx,%ecx je 3f lea (%rax,%rcx,8),%rdi 2: movzwl 0x6(%rdx),%ecx mov %cx,0x6(%rax) add $0x8,%rax add $0x8,%rdx cmp %rdi,%rax jne 2b 3: mov (%rsi),%rsi test %rsi,%rsi jne 1b with the attached patch. Before, it would do the whole CLAC/STAC dance inside that loop for every entry (and with that commit d55564cfc22 it would be a function call, of course). Can you verify that this fixes the regression (and in fact I'd expect it to improve that test-case)? NOTE! The patch is entirely untested. I verified that the code generation now looks sane, and it all looks ObviouslyCorrect(tm) to me, but mistakes happen and maybe I missed some detail.. Linus
patch
Description: Binary data