* Jakub Jelinek: > The disadvantage of the patch is that touching reg[x].loc and how[x] > now means 2 cachelines rather than one as before, and I admit beyond > bootstrap/regtest I haven't benchmarked it in any way. Florian, could > you retry whatever you measured to get at the 40% of time spent on the > stack clearing to see how the numbers change?
A benchmark that unwinds through 100 frames containing a std::string variable goes from (0b5b8ac5cb7fe92dd17ae8bd7de84640daa59e84): min: 24418 ns 25%: 24740 ns 50%: 24790 ns 75%: 24840 ns 95%: 24937 ns 99%: 26174 ns max: 42530 ns avg: 24826.1 ns to (0b5b8ac5cb7fe92dd17ae8bd7de84640daa59e84 with this patch): min: 22307 ns 25%: 22640 ns 50%: 22713 ns 75%: 22787 ns 95%: 22948 ns 99%: 24839 ns max: 52658 ns avg: 22863.4 ns So 227 ns per frame instead of 248 ns per frame, or ~9% less. Moving cfa_how after how in struct frame_state_reg_info as an 8-bit bitfield should avoid zeroing another 8 bytes. This shaves off another 3 ns per frame in my testing (on a Core i9-10900T, so with ERMS). The REP STOS still dominates uw_frame_state_for execution time, but this seems to be a profiling artifact. Replacing it with PXOR and seven MOVUPS instructions makes the hotspot go away, but performance does not improve. Odd. Thanks, Florian