On 6/23/24 14:27, Alexander Monakov wrote:
Hello,

On Wed, 12 Jun 2024, Paolo Bonzini wrote:

I didn't do this because of RHEL9, I did it because it's silly that
QEMU cannot use POPCNT and has to waste 2% of the L1 d-cache to
compute the x86 parity flag (and POPCNT was introduced at the same
time as SSE4.2).

I do not see where the 2% figure is coming from: even considering that
the 256-byte LUT may take an extra cache line due to misalignment, 320
bytes is still less than 1% of 32KB L1D size.

More importantly, the way this comment is phrased made me think that Qemu
eagerly computes PF. But the comment in target/i386/cpu.h is saying that
all flags are computed in an on-demand manner. Considering that software
pretty much never uses PF, why would the parity table be resident in L1D?
As far as I can see, the cost is rather a cache miss and perhaps a TLB miss
when PF is computed (mostly when EFLAGS are accessed all together on
context switches I think).

Is there something I'm not seeing?

We delay flags computation until they're needed (since flags are often overwritten by the very next instruction), but when we do, we compute all of the flags. So PF is computed at that point, even if PF itself will never be read.


r~

Reply via email to