On Mon, May 27, 2019 at 11:05:38AM +0000, Roman Kagan wrote: > On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: > > > > Roman Kagan <rka...@virtuozzo.com> writes: > > > > > I came across the following AB-BA deadlock: > > > > > > vCPU thread main thread > > > ----------- ----------- > > > async_safe_run_on_cpu(self, > > > async_synic_update) > > > ... [cpu hot-add] > > > process_queued_cpu_work() > > > qemu_mutex_unlock_iothread() > > > [grab BQL] > > > start_exclusive() cpu_list_add() > > > async_synic_update() finish_safe_work() > > > qemu_mutex_lock_iothread() cpu_exec_start() > > > > > > ATM async_synic_update seems to be the only async safe work item that > > > grabs BQL. However it isn't quite obvious that it shouldn't; in the > > > past there were more examples of this (e.g. > > > memory_region_do_invalidate_mmio_ptr). > > > > > > It looks like the problem is generally in the lack of the nesting rule > > > for cpu-exclusive sections against BQL, so I thought I would try to > > > address that. This patchset is my feeble attempt at this; I'm not sure > > > I fully comprehend all the consequences (rather, I'm sure I don't) hence > > > RFC. > > > > Hmm I think this is an area touched by: > > > > Subject: [PATCH v7 00/73] per-CPU locks > > Date: Mon, 4 Mar 2019 13:17:00 -0500 > > Message-Id: <20190304181813.8075-1-c...@braap.org> > > > > which has stalled on it's path into the tree. Last time I checked it > > explicitly handled the concept of work that needed the BQL and work that > > didn't. > > I'm still trying to get my head around that patchset, but it looks like > it changes nothing in regards to cpu-exclusive sections and safe work, > so it doesn't make the problem go. > > > How do you trigger your deadlock? Just hot-pluging CPUs? > > Yes. The window is pretty narrow so I only saw it once although this > test (where the vms are started and stopped and the cpus are plugged in > and out) is in our test loop for quite a bit (probably 2+ years). > > Roman.
ping?