* Jakub Jelinek:

> The disadvantage of the patch is that touching reg[x].loc and how[x]
> now means 2 cachelines rather than one as before, and I admit beyond
> bootstrap/regtest I haven't benchmarked it in any way.  Florian, could
> you retry whatever you measured to get at the 40% of time spent on the
> stack clearing to see how the numbers change?

A benchmark that unwinds through 100 frames containing a std::string
variable goes from (0b5b8ac5cb7fe92dd17ae8bd7de84640daa59e84):

min:     24418 ns
25%:     24740 ns
50%:     24790 ns
75%:     24840 ns
95%:     24937 ns
99%:     26174 ns
max:     42530 ns
avg:   24826.1 ns

to (0b5b8ac5cb7fe92dd17ae8bd7de84640daa59e84 with this patch):

min:     22307 ns
25%:     22640 ns
50%:     22713 ns
75%:     22787 ns
95%:     22948 ns
99%:     24839 ns
max:     52658 ns
avg:   22863.4 ns

So 227 ns per frame instead of 248 ns per frame, or ~9% less.

Moving cfa_how after how in struct frame_state_reg_info as an 8-bit
bitfield should avoid zeroing another 8 bytes.  This shaves off another
3 ns per frame in my testing (on a Core i9-10900T, so with ERMS).

The REP STOS still dominates uw_frame_state_for execution time, but this
seems to be a profiling artifact.  Replacing it with PXOR and seven
MOVUPS instructions makes the hotspot go away, but performance does not
improve.  Odd.

Thanks,
Florian

Reply via email to