On 12/11/2021 10:43, Jan Beulich wrote:
On 11.11.2021 18:57, Andrew Cooper wrote:
Function pointers are expensive, and the raw parameter is a constant from all
callers, meaning that it predicts very well with local branch history.
The code change is fine, but I'm having trouble with "all" here: Both
functions aren't even static, so while callers in io_apic.c may
benefit (perhaps with the exception of ioapic_{read,write}_entry(),
depending on whether the compiler views inlining them as warranted),
I'm in no way convinced this extends to the callers in VT-d code.
Further ISTR clang being quite a bit less aggressive about inlining,
so the effects might not be quite as good there even for the call
sites in io_apic.c.
Can you clarify this for me please?
The way the compiler lays out the code is unrelated to why this form is
an improvement.
Branch history is a function of "the $N most recently taken branches".
This is because "how you got here" is typically relevant to "where you
should go next".
Trivial schemes maintain a shift register of taken / not-taken results.
Less trivial schemes maintain a rolling hash of (src addr, dst addr)
tuples of all taken branches (direct and indirect). In both cases, the
instantaneous branch history is an input into the final prediction, and
is commonly used to select which saturating counter (or bank of
counters) is used.
Consider something like
while ( cond )
{
memcpy(dst1, src1, 64);
memcpy(dst2, src2, 7);
}
Here, the conditional jump inside memcpy() coping with the tail of the
copy flips result 50% of the time, which is fiendish to predict for.
However, because the branch history differs (by memcpy()'s return
address which was accumulated by the call instruction), the predictor
can actually use two different taken/not-taken counters for the two
different "instances" if the tail jump. After a few iterations to warm
up, the predictor will get every jump perfect despite the fact that
memcpy() is a library call and the branches would otherwise alias.
Bringing it back to the code in question. The "raw" parameter is an
explicit true or false at the top of all call paths leading into these
functions. Therefore, an individual branch history has a high
correlation with said true or false, irrespective of the absolute code
layout. As a consequence, the correct result of the prediction is
highly correlated with the branch history, and it will predict
perfectly[1] after a few times the path has been used.
~Andrew
[1] Obviously, it's not actually perfect outside of a synthetic
example. Aliasing in the predictor is a necessary property of keeping
the logic small enough to provide an answer fast, but the less
accidental aliasing there is, the faster the CPU performance in
benchmarks, so incentives are in our favour here.