Hi On Mon, Aug 28, 2017 at 5:05 AM, Peter Xu <pet...@redhat.com> wrote: > On Fri, Aug 25, 2017 at 04:07:34PM +0000, Marc-André Lureau wrote: >> On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <dgilb...@redhat.com> >> wrote: >> >> > * Marc-André Lureau (marcandre.lur...@gmail.com) wrote: >> > > Hi >> > > >> > > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <pet...@redhat.com> wrote: >> > > >> > > > Firstly, introduce Monitor.use_thread, and set it for monitors that are >> > > > using non-mux typed backend chardev. We only do this for monitors, so >> > > > mux-typed chardevs are not suitable (when it connects to, e.g., serials >> > > > and the monitor together). >> > > > >> > > > When use_thread is set, we create standalone thread to poll the monitor >> > > > events, isolated from the main loop thread. Here we still need to take >> > > > the BQL before dispatching the tasks since some of the monitor commands >> > > > are not allowed to execute without the protection of BQL. Then this >> > > > gives us the chance to avoid taking the BQL for some monitor commands >> > in >> > > > the future. >> > > > >> > > > * Why this change? >> > > > >> > > > We need these per-monitor threads to make sure we can have at least one >> > > > monitor that will never stuck (that can receive further monitor >> > > > commands). >> > > > >> > > > * So when will monitors stuck? And, how do they stuck? >> > > > >> > > > After we have postcopy and remote page faults, it's simple to achieve a >> > > > stuck in the monitor (which is also a stuck in main loop thread): >> > > > >> > > > (1) Monitor deadlock on BQL >> > > > >> > > > As we may know, when postcopy is running on destination VM, the vcpu >> > > > threads can stuck merely any time as long as it tries to access an >> > > > uncopied guest page. Meanwhile, when the stuck happens, it is possible >> > > > that the vcpu thread is holding the BQL. If the page fault is not >> > > > handled quickly, you'll find that monitors stop working, which is >> > trying >> > > > to take the BQL. >> > > > >> > > > If the page fault cannot be handled correctly (one case is a paused >> > > > postcopy, when network is temporarily down), monitors will hang >> > > > forever. Without current patch, that means the main loop hanged. >> > We'll >> > > > never find a way to talk to VM again. >> > > > >> > > >> > > Could the BQL be pushed down to the monitor commands level instead? That >> > > way we wouldn't need a seperate thread to solve the hang on commands that >> > > do not need BQL. >> > >> > If the main thread is stuck though I don't see how that helps you; you >> > have to be able to run these commands on another thread. >> > >> >> Why would the main thread be stuck? In (1) If the vcpu thread takes the BQL >> and the command doesn't need it, it would work. In (2), info cpus >> shouldn't keep the BQL (my qapi-async series would probably help here) > > (Thanks for joining the discussion) > > AFAIK the main thread can be stuck for many reasons. I have seen one > stack when the VGA code (IIUC) was trying to writting to guest graphic > memory in main loop thread but luckily that guest page is still not > copied yet from source. As long as the main thread is stuck for any > reason, no chance for monitor commands, even if the commands support > async operations.
If that command becomes async (it probably should, any command doing IO probaly should), then the main loop can keep running. > > So IMHO the only solution is doing these things in separate threads, > rather than all in a single one. I wouldn't say it's the only solution. I think the monitor can touch many areas that haven't been written with multi-threading in mind. My proposal is probably safer, although I don't know how hard it would be to push the BQL down to QMP commands, and make async existing IO commands. The benefits of this work are quite interesting imho, because a stuck mainloop is basically a stuck qemu, and an additional thread will not solve it... -- Marc-André Lureau