On Tue, Oct 23, 2018 at 08:02:42 +0100, Richard Henderson wrote: > The motivation here is reducing the total overhead. > > Before a few patches went into target-arm.next, I measured total > tlb flush overhead for aarch64 at 25%. This appears to reduce the > total overhead to about 5% (I do need to re-run the control tests, > not just watch perf top as I'm doing now).
I'd like to see those absolute perf numbers; I ran a few Ubuntu aarch64 boots and the noise is just too high to draw any conclusions (I'm using your tlb-dirty branch on github). When booting the much smaller debian image, these patches are performance-neutral though. So, Reviewed-by: Emilio G. Cota <c...@braap.org> for the series. (On a pedantic note: consider s/miniscule/minuscule/ in patches 6-7) > The final patch is somewhat of an RFC. I'd like to know what > benchmark was used when putting in pending_tlb_flushes, and I > have not done any archaeology to find out. I suspect that it > does make any measurable difference beyond tlb_c.dirty, and I > think the code is a bit cleaner without it. I suspect that pending_tlb_flushes was premature optimization. Avoiding an async job sounds like a good idea, since it is very expensive for the remote vCPU. However, in most cases we'll be taking a lock (or a full barrier in the original code) but we won't avoid the async job (because a race when flushing other vCPUs is unlikely), therefore wasting cycles in the lock (formerly barrier). Thanks, Emilio