Re: Logging, abnormal cases, ...

Dr. David Alan Gilbert Thu, 17 Dec 2020 12:37:36 -0800

* Michael Tokarev (m...@tls.msk.ru) wrote:
> Hi!
> 
> I want to bring a topic which is mostly neglected in qemu but it is
> one of very important, especially for serious usage of qemu.
> 
> This is about logging of various events or even some unexpected events
> coming from guest.
> 
> Let's see for example a very easy to understand situation: a virtual
> disk image of a VM runs out of space _on the host_, so that qemu does
> not have space to write newly allocated blocks to.  What does it do?
> By default it pauses the VM and just sits there waiting. But does it
> lets someone to know that it run onto something which it can't handle?
> Does it write at least some information about it somewhere?
> 
> I know this very situation should not happen in any serious environment
> to start with, but everything happens, and the should ring a very loud
> bell somewhere so that people can react sooner than later after numerous
> attempt to _understand_ what's going on.
> 
> What I'm talking about is that we should have some standard way of
> logging such events, at least, or a way of notifying something about these.
> Including for example start/stop, suspend/resume, migration out/migration
> in etc - while these might look like a job of some management layer on
> top of qemu (like libvirt), it's still better do that in qemu so that
> all possible tools using it will benefit.
> 
> But it is not restricted to such global events like outlined already,
> there are much more interesting cases there. For another example,
> there's a lot of places where qemu checks validity of various requests
> coming to virtual hardware it emulates, - so that DMA is done to a
> valid address, register holds valid index, etc etc. A majority of
> these checks results in a simple error return or even abort().
> But it'd be very interesting to be able to log these events, maybe
> with some context, and be able to turn that logging on/off at runtime.
> 
> The thing is - there ara at least 2 types of guests out there, one
> is possible "hostile" - this is what all the recent CVEs are about,
> it is possible to abuse qemu in one way or another by doing some
> strange things within a guest. And another is entirely different -
> a "friendly" guest which is here just for easy management, so that
> various migrations, expansions and reconfigurations are possible.
> And there, it should not do evil things, and if it does, we'd better
> know about that, as it is a possible bug which might turn into a
> smoking gun one day.
> 
> The other day we had a quite bad system crash in a VMWare environment
> which resulted in a massive data loss, interruption of service and
> lots of really hard work for everyone.  A crash which actually started
> after vmware side discovered that it does not have any place where to
> put a new write for a (production, "friendly") guest. After that issue
> were fixed, there were several more conditions _like_ this happened, -
> not with lack of free space, but with the same freezing, vmware stopped
> the guest for a few seconds several times, and I've seen brief change
> in the guest's icon color (turning to red) in the monitor at least twice,
> during these freezes. But there's nothing in the logs - *why* it stopped
> the guest, what happened?? - nothing's left.  And finally it crashed
> hard after live-migrating the VM into another host with larger capacity
> storage - at the moment the migration finished, the VM freeze for a few
> seconds, there was tons of kernel messages about stuck processes and
> kernel threads after that, and, among other things, there were a few
> error messages about failed writes into the (virtual) storage. Which,
> in turn, hit some bugs in oracle and finally the db become corrupt and
> oracle instance refused to open it in any way, crashing with sigsegvs
> at startup, so we had to resort to the harder way.


I'm glad it's not just me who has hard migration problems!

> The thing was that there was absolutely nothing on the vmware side,
> even the fact of migration weren't obvious in its logs. I bet there
> were some inconsistences within data request handling from the guest
> at least, or if system clocks were different this could have been
> logged too, or anything else not-so-expected. But there's nothing,
> so we don't sill know what happened. Sure it is a bug somewhere,
> maybe several bugs at once (vmware tools are running in the vm and
> the hypervisor surely can assume a safe state of the vm while doing
> last steps of the migration, oracle should not sigsegv on bad blocks,
> and so on). But if there were logs, we had at least some chance to
> debug and maybe fix it, or to work around this if fixing isn't possible.
> But there's nothing.
> 
> I think we can do much better in qemu.
> 
> Comments?

We've got lots of things, the problem is deciding what to use when.
We can:
  a) Send QMP events - that's pretty good for a 'hey I've stopped, disk
full'
  b) We've got a logging mechanism, but I think it's rarely used.
  c) We've got trace events - that's good for fine debug but not
what you want all the time.

For migration, if something fails, I always try and print to stderr
so I get it in the /var/log/libvirt/qemu/... log;  the problem is
that only happens if you know you've screwed up - if you corrupt some
state during the migrate, then a hung guest is often the outcome.

Dave

> Thanks,
> 
> /mjt
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: Logging, abnormal cases, ...

Reply via email to