Excerpts from Richard Henderson's message of May 20, 2022 4:31 am: > On 5/19/22 07:11, Nicholas Piggin wrote: >> dcache writeback and icache invalidate is not required when icache is >> coherent, a shorter fixed-length sequence can be used which just has to >> flush and re-fetch instructions that were in-flight. >> >> Signed-off-by: Nicholas Piggin <npig...@gmail.com> >> --- >> >> I haven't been able to measure a significant performance difference >> with this, qemu isn't flushing large ranges frequently so the old sequence >> is not that slow. > > Yeah, we should be flushing smallish regions (< 1-4k), as we generate > TranslationBlocks. > And hopefully the translation cache is large enough that we spend more time > executing > blocks than re-compiling them. ;-) > > >> +++ b/include/qemu/cacheflush.h >> @@ -28,6 +28,10 @@ static inline void flush_idcache_range(uintptr_t rx, >> uintptr_t rw, size_t len) >> >> #else >> >> +#if defined(__powerpc__) >> +extern bool have_coherent_icache; >> +#endif > > Ug. I'm undecided where to put this. I'm tempted to say... > >> --- a/util/cacheflush.c >> +++ b/util/cacheflush.c >> @@ -108,7 +108,16 @@ void flush_idcache_range(uintptr_t rx, uintptr_t rw, >> size_t len) > > ... here in cacheflush.c, with a comment that the variable is defined and > initialized in > cacheinfo.c. > > I'm even more tempted to merge the two files to put all of the > machine-specific cache data > in the same place, then this variable can be static. There's even an > existing TODO > comment in cacheflush.c for aarch64.
That might be nice. Do you want me to look at doing that first? >> b = rw & ~(dsize - 1); >> + >> + if (have_coherent_icache) { >> + asm volatile ("sync" : : : "memory"); >> + asm volatile ("icbi 0,%0" : : "r"(b) : "memory"); >> + asm volatile ("isync" : : : "memory"); >> + return; >> + } > > Where can I find definitive rules on this? In processor manuals (I don't know if there are any notes about this in the ISA, I would be tempted to say there should be since many processors implement it). POWER9 UM, 4.6.2.2 Instruction Cache Block Invalidate (icbi) https://ibm.ent.box.com/s/tmklq90ze7aj8f4n32er1mu3sy9u8k3k > Note that rx may not equal rw, and that we've got two virtual mappings for > the same > memory, one for "data" that is read-write and one for "execute" that is > read-execute. > (This split is enabled only for --enable-debug-tcg builds on linux, to make > sure we don't > regress apple m1, which requires the split all of the time.) > > In particular, you're flushing one icache line with the dcache address, and > that you're > not flushing any of the other lines. Is the coherent icache thing really > that we may > simply skip the dcache flush step, but must still flush all of the icache > lines? Yeah it's just a funny sequence the processor implements. It treats icbi almost as a no-op except that it sets a flag such that the next isync will flush and refetch the pipeline. It doesn't do any cache flushing. > Without docs, "icache snoop" to me would imply that we only need the two > barriers and no > flushes at all, just to make sure all memory writes complete before any new > instructions > are executed. This would be like the two AArch64 bits, IDC and DIC, which > indicate that > the two caches are coherent to Point of Unification, which leaves us with > just the > Instruction Sequence Barrier at the end of the function. > > >> +bool have_coherent_icache = false; > > scripts/checkpatch.pl should complain this is initialized to 0. > > >> static void arch_cache_info(int *isize, int *dsize) >> { >> +# ifdef PPC_FEATURE_ICACHE_SNOOP >> + unsigned long hwcap = qemu_getauxval(AT_HWCAP); >> +# endif >> + >> if (*isize == 0) { >> *isize = qemu_getauxval(AT_ICACHEBSIZE); >> } >> if (*dsize == 0) { >> *dsize = qemu_getauxval(AT_DCACHEBSIZE); >> } >> + >> +# ifdef PPC_FEATURE_ICACHE_SNOOP >> + have_coherent_icache = (hwcap & PPC_FEATURE_ICACHE_SNOOP) != 0; >> +# endif > > Better with only one ifdef, moving this second hunk up. Will clean those bits up, thanks. > It would be nice if there were some kernel documentation for this... arm64 has kernel docs for hwcaps... powerpc probably should as well. Good point, I might do a patch for that. Thanks, Nick