Hi Andrea, On Thu, Feb 8, 2024 at 12:42 PM Andrea Parri <parri.and...@gmail.com> wrote: > > > +static int __ftrace_modify_code(void *data) > > +{ > > + struct ftrace_modify_param *param = data; > > + > > + if (atomic_inc_return(¶m->cpu_count) == num_online_cpus()) { > > + ftrace_modify_all_code(param->command); > > + atomic_inc(¶m->cpu_count); > > I stared at ftrace_modify_all_code() for a bit but honestly I don't see > what prevents the ->cpu_count increment from being reordered before the > insn write(s) (architecturally) now that you have removed the IPI dance: > perhaps add an smp_wmb() right before the atomic_inc() (or promote this > latter to a (void)atomic_inc_return_release()) and/or an inline comment > saying why such reordering is not possible?
I did not even think of that, and it actually makes sense so I'll go with what you propose: I'll replace atomic_inc() with atomic_inc_return_release(). And I'll add the following comment if that's ok with you: "Make sure the patching store is effective *before* we increment the counter which releases all waiting cpus" > > > > + } else { > > + while (atomic_read(¶m->cpu_count) <= num_online_cpus()) > > + cpu_relax(); > > + smp_mb(); > > I see that you've lifted/copied the memory barrier from patch_text_cb(): > what's its point? AFAIU, the barrier has no ordering effect on program > order later insn fetches; perhaps the code was based on some legacy/old > version of Zifencei? IAC, comments, comments, ... or maybe just remove > that memory barrier? Honestly, I looked at it one minute, did not understand its purpose and said to myself "ok that can't hurt anyway, I may be missing something". FWIW, I see that arm64 uses isb() here. If you don't see its purpose, I'll remove it (here and where I copied it). > > > > + } > > + > > + local_flush_icache_all(); > > + > > + return 0; > > +} > > [...] > > > > @@ -232,8 +230,7 @@ static int patch_text_cb(void *data) > > if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) { > > for (i = 0; ret == 0 && i < patch->ninsns; i++) { > > len = GET_INSN_LENGTH(patch->insns[i]); > > - ret = patch_text_nosync(patch->addr + i * len, > > - &patch->insns[i], len); > > + ret = patch_insn_write(patch->addr + i * len, > > &patch->insns[i], len); > > } > > atomic_inc(&patch->cpu_count); > > } else { > > @@ -242,6 +239,8 @@ static int patch_text_cb(void *data) > > smp_mb(); > > } > > > > + local_flush_icache_all(); > > + > > return ret; > > } > > NOKPROBE_SYMBOL(patch_text_cb); > > My above remarks/questions also apply to this function. > > > On a last topic, although somehow orthogonal to the scope of this patch, > I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is > correct: I can see why we may want (need to do) the local TLB flush be- > fore returning from patch_{map,unmap}(), but does a local flush suffice? > For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb) > sequence in their unmapping stage (and apparently relying on "no caching > of invalid ptes" in their mapping stage). Of course, "broadcasting" our > (riscv's) TLB invalidations will necessary introduce some complexity... > > Thoughts? To avoid remote TLBI, could we simply disable the preemption before the first patch_map()? arm64 disables the irqs, but that seems overkill to me, but maybe I'm missing something again? Thanks for your comments Andrea, Alex > > Andrea