On 08/01/18 10:42, Paul Turner wrote: > A sequence for efficiently refilling the RSB is: > mov $8, %rax; > .align 16; > 3: call 4f; > 3p: pause; call 3p; > .align 16; > 4: call 5f; > 4p: pause; call 4p; > .align 16; > 5: dec %rax; > jnz 3b; > add $(16*8), %rsp; > This implementation uses 8 loops, with 2 calls per iteration. This is > marginally faster than a single call per iteration. We did not > observe useful benefit (particularly relative to text size) from > further unrolling. This may also be usefully split into smaller (e.g. > 4 or 8 call) segments where we can usefully pipeline/intermix with > other operations. It includes retpoline type traps so that if an > entry is consumed, it cannot lead to controlled speculation. On my > test system it took ~43 cycles on average. Note that non-zero > displacement calls should be used as these may be optimized to not > interact with the RSB due to their use in fetching RIP for 32-bit > relocations.
Guidance from both Intel and AMD still states that 32 calls are required in general. Is your above code optimised for a specific processor which you know the RSB to be smaller on? ~Andrew