On Mon, Oct 16, 2017 at 10:25:28 -0700, Richard Henderson wrote: > From: Richard Henderson <r...@twiddle.net> > > This avoids having to allocate external memory for each temporary. > > Signed-off-by: Richard Henderson <r...@twiddle.net> > ---
Unfortunately, this patch undoes the small perf gains we made so far in this series. We end up running more instructions, I guess due to the loops in setting the per-temp states (whereas earlier we just had a memset). Same aarch64 boot benchmark, 10 runs: Before: 7125.400889 task-clock (msec) # 0.998 CPUs utilized ( +- 0.15% ) 21,654 context-switches # 0.003 M/sec ( +- 0.12% ) 1 cpu-migrations # 0.000 K/sec 8,034 page-faults # 0.001 M/sec ( +- 1.22% ) 30,050,759,263 cycles # 4.217 GHz ( +- 0.15% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 53,764,201,351 instructions # 1.79 insns per cycle ( +- 0.09% ) 9,677,042,191 branches # 1358.105 M/sec ( +- 0.09% ) 170,903,903 branch-misses # 1.77% of all branches ( +- 0.16% ) 7.136617151 seconds time elapsed ( +- 0.17% ) After: 7326.945822 task-clock (msec) # 0.999 CPUs utilized ( +- 0.24% ) 21,997 context-switches # 0.003 M/sec ( +- 0.16% ) 1 cpu-migrations # 0.000 K/sec 8,400 page-faults # 0.001 M/sec ( +- 4.63% ) 30,900,509,346 cycles # 4.217 GHz ( +- 0.23% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 55,736,672,258 instructions # 1.80 insns per cycle ( +- 0.16% ) 9,989,723,969 branches # 1363.423 M/sec ( +- 0.16% ) 179,662,782 branch-misses # 1.80% of all branches ( +- 0.16% ) 7.335805286 seconds time elapsed ( +- 0.24% ) I tried merging .state into the bitfield, but that didn't help (the dcache isn't the issue here). Anyway we use .state_ptr later in this series, so: Reviewed-by: Emilio G. Cota <c...@braap.org> E.