Aaron Lindsay <aa...@os.amperecomputing.com> writes:
> Hello, > > I have been wrestling with what might be a bug in the plugin memory > callbacks. The immediate error is that I hit the > `g_assert_not_reached()` in the 'default:' case in > qemu_plugin_vcpu_mem_cb, indicating the callback type was invalid. When > breaking on this assertion in gdb, the contents of cpu->plugin_mem_cbs > are obviously bogus (`len` was absurdly high, for example). After doing > some further digging/instrumenting, I eventually found that > `free_dyn_cb_arr(void *p, ...)` is being called shortly before the > assertion is hit with `p` pointing to the same address as > `cpu->plugin_mem_cbs` will later hold at assertion-time. We are freeing > the memory still pointed to by `cpu->plugin_mem_cbs`. > > I believe the code *should* always reset `cpu->plugin_mem_cbs` to NULL at the > end of an instruction/TB's execution, so its not exactly clear to me how this > is occurring. However, I suspect it may be relevant that we are calling > `free_dyn_cb_arr()` because my plugin called `qemu_plugin_reset()`. Hmm I'm going to have to remind myself about how this bit works. > > I have additionally found that the below addition allows me to run > successfully > without hitting the assert: > > diff --git a/plugins/core.c b/plugins/core.c > --- a/plugins/core.c > +++ b/plugins/core.c > @@ -427,9 +427,14 @@ static bool free_dyn_cb_arr(void *p, uint32_t h, void > *userp) > > void qemu_plugin_flush_cb(void) > { > + CPUState *cpu; > qht_iter_remove(&plugin.dyn_cb_arr_ht, free_dyn_cb_arr, NULL); > qht_reset(&plugin.dyn_cb_arr_ht); > > + CPU_FOREACH(cpu) { > + cpu->plugin_mem_cbs = NULL; > + } > + This is essentially qemu_plugin_disable_mem_helpers() but for all CPUs. I think we should be able to treat the CPUs separately. > plugin_cb__simple(QEMU_PLUGIN_EV_FLUSH); > } > > Unfortunately, the workload/setup I have encountered this bug with are > difficult to reproduce in a way suitable for sharing upstream (admittedly > potentially because I do not fully understand the conditions necessary to > trigger it). It is also deep into a run How many full TB flushes have there been? You only see qemu_plugin_flush_cb when we flush whole translation buffer (which is something we do more often when plugins exit). Does lowering tb-size make it easier to hit the failure mode? > , and I haven't found a good way > to break in gdb immediately prior to it happening in order to inspect > it, without perturbing it enough such that it doesn't happen... This is exactly the sort of thing rr is great for. Can you trigger it in that? https://rr-project.org/ > > I welcome any feedback or insights on how to further nail down the > failure case and/or help in working towards an appropriate solution. > > Thanks! > > -Aaron -- Alex Bennée