* Michael Tokarev (m...@tls.msk.ru) wrote: > Hi! > > I want to bring a topic which is mostly neglected in qemu but it is > one of very important, especially for serious usage of qemu. > > This is about logging of various events or even some unexpected events > coming from guest. > > Let's see for example a very easy to understand situation: a virtual > disk image of a VM runs out of space _on the host_, so that qemu does > not have space to write newly allocated blocks to. What does it do? > By default it pauses the VM and just sits there waiting. But does it > lets someone to know that it run onto something which it can't handle? > Does it write at least some information about it somewhere? > > I know this very situation should not happen in any serious environment > to start with, but everything happens, and the should ring a very loud > bell somewhere so that people can react sooner than later after numerous > attempt to _understand_ what's going on. > > What I'm talking about is that we should have some standard way of > logging such events, at least, or a way of notifying something about these. > Including for example start/stop, suspend/resume, migration out/migration > in etc - while these might look like a job of some management layer on > top of qemu (like libvirt), it's still better do that in qemu so that > all possible tools using it will benefit. > > But it is not restricted to such global events like outlined already, > there are much more interesting cases there. For another example, > there's a lot of places where qemu checks validity of various requests > coming to virtual hardware it emulates, - so that DMA is done to a > valid address, register holds valid index, etc etc. A majority of > these checks results in a simple error return or even abort(). > But it'd be very interesting to be able to log these events, maybe > with some context, and be able to turn that logging on/off at runtime. > > The thing is - there ara at least 2 types of guests out there, one > is possible "hostile" - this is what all the recent CVEs are about, > it is possible to abuse qemu in one way or another by doing some > strange things within a guest. And another is entirely different - > a "friendly" guest which is here just for easy management, so that > various migrations, expansions and reconfigurations are possible. > And there, it should not do evil things, and if it does, we'd better > know about that, as it is a possible bug which might turn into a > smoking gun one day. > > The other day we had a quite bad system crash in a VMWare environment > which resulted in a massive data loss, interruption of service and > lots of really hard work for everyone. A crash which actually started > after vmware side discovered that it does not have any place where to > put a new write for a (production, "friendly") guest. After that issue > were fixed, there were several more conditions _like_ this happened, - > not with lack of free space, but with the same freezing, vmware stopped > the guest for a few seconds several times, and I've seen brief change > in the guest's icon color (turning to red) in the monitor at least twice, > during these freezes. But there's nothing in the logs - *why* it stopped > the guest, what happened?? - nothing's left. And finally it crashed > hard after live-migrating the VM into another host with larger capacity > storage - at the moment the migration finished, the VM freeze for a few > seconds, there was tons of kernel messages about stuck processes and > kernel threads after that, and, among other things, there were a few > error messages about failed writes into the (virtual) storage. Which, > in turn, hit some bugs in oracle and finally the db become corrupt and > oracle instance refused to open it in any way, crashing with sigsegvs > at startup, so we had to resort to the harder way.
I'm glad it's not just me who has hard migration problems! > The thing was that there was absolutely nothing on the vmware side, > even the fact of migration weren't obvious in its logs. I bet there > were some inconsistences within data request handling from the guest > at least, or if system clocks were different this could have been > logged too, or anything else not-so-expected. But there's nothing, > so we don't sill know what happened. Sure it is a bug somewhere, > maybe several bugs at once (vmware tools are running in the vm and > the hypervisor surely can assume a safe state of the vm while doing > last steps of the migration, oracle should not sigsegv on bad blocks, > and so on). But if there were logs, we had at least some chance to > debug and maybe fix it, or to work around this if fixing isn't possible. > But there's nothing. > > I think we can do much better in qemu. > > Comments? We've got lots of things, the problem is deciding what to use when. We can: a) Send QMP events - that's pretty good for a 'hey I've stopped, disk full' b) We've got a logging mechanism, but I think it's rarely used. c) We've got trace events - that's good for fine debug but not what you want all the time. For migration, if something fails, I always try and print to stderr so I get it in the /var/log/libvirt/qemu/... log; the problem is that only happens if you know you've screwed up - if you corrupt some state during the migrate, then a hung guest is often the outcome. Dave > Thanks, > > /mjt > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK