On Thu, Mar 22, 2018 at 01:29:23 +0000, Alex Bennée wrote: > Emilio G. Cota <c...@braap.org> writes: > > > Performance results for fp-bench run under aarch64-linux-user > > on an Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz host: > > > > - before: > > sqrt-single: 13.23 MFlops > > sqrt-double: 13.24 MFlops > > > > - after: > > sqrt-single: 15.02 MFlops > > sqrt-double: 15.07 MFlops > > > > Note that sqrt in soft-ft is relatively fast, which means > > that fp-bench is not very sensitive to changes to sqrt's > > emulation speed. > > Weird, I thought we had slowed it down quite a bit in the re-factor as > we eschewed the estimate step for an easier to read but slower iterative > process. That's why I chose sqrt for my hostfp hack experiment.
Yes, my first statement ("soft-ft is relatively fast") is wrong. Sorry about that, I thought I had deleted it but it slipped through. What I should have said (but decided against to keep the commit log short) is that fp-bench doesn't do a good job in being sensitive to the performance of the sqrt instruction, so even if got it to take 0 time we'd still get a small speedup. Just realised that this happens because ~50% of the inputs are negative, which will go through some very slow paths. This ends up showing in perf like this: # Overhead Command Shared Object Symbol # ........ ........ ................. ........................... # 61.74% fp-bench fp-bench [.] main 22.58% fp-bench libm-2.23.so [.] __kernel_standard 6.22% fp-bench libm-2.23.so [.] __kernel_standard_f 5.21% fp-bench libm-2.23.so [.] __sqrtf 2.17% fp-bench fp-bench [.] _init 1.91% fp-bench [kernel.kallsyms] [k] __call_rcu.constprop.70 0.18% fp-bench [kernel.kallsyms] [k] cpumask_any_but 0.01% perf [kernel.kallsyms] [k] native_iret 0.00% perf [kernel.kallsyms] [k] native_write_msr_safe __sqrtf (which does 'sqrtss %xmm0,%xmm0') only takes 5% of the time! I just fixed fp-bench to discard negative inputs. This looks much better: (Note that this is fp-test-x86_64 instead of -aarch64, which explains why the "before" throughput is different than the one reported above) [...] +fma: (patch 11, i.e. sqrt still in soft-fp) sqrt-single: 27.11 MFlops sqrt-double: 27.17 MFlops +sqrt: (12) sqrt-single: 66.67 MFlops sqrt-double: 66.79 MFlops +cmp: (13) sqrt-single: 126.46 MFlops sqrt-double: 126.06 MFlops +f32f64: (patch 14) sqrt-single: 122.75 MFlops sqrt-double: 126.57 MFlops We get a >2x speedup, which is consistent with the fact that now perf shows that sqrt takes ~60% of execution time. Compare does matter here as well because libm is checking sqrt's result against NaN. I'll include this fix to fp-bench in v2. Thanks, E.