Re: [Qemu-devel] Error handling in realize() methods

Markus Armbruster Thu, 10 Dec 2015 01:29:08 -0800

"Dr. David Alan Gilbert" <dgilb...@redhat.com> writes:

> * Markus Armbruster (arm...@redhat.com) wrote:
>> "Dr. David Alan Gilbert" <dgilb...@redhat.com> writes:
>> 
>> > * Markus Armbruster (arm...@redhat.com) wrote:
>> >> In general, code running withing a realize() method should not exit() on
>> >> error.  Instad, errors should be propagated through the realize()
>> >> method.  Additionally, the realize() method should fail cleanly,
>> >> i.e. carefully undo its side effects such as wiring of interrupts,
>> >> mapping of memory, and so forth.  Tedious work, but necessary to make
>> >> hot plug safe.
>> [...]
>> >> Next, let's consider the special case "out of memory".
>> >> 
>> >> Our general approach is to treat it as immediately fatal.  This makes
>> >> sense, because when a smallish allocation fails, the process is almost
>> >> certainly doomed anyway.  Moreover, memory allocation is frequent, and
>> >> attempting to recover from failed memory allocation adds loads of
>> >> hard-to-test error paths.  These are *dangerous* unless carefully tested
>> >> (and we don't).
>> >> 
>> >> Certain important allocations we handle more gracefully.  For instance,
>> >> we don't want to die when the user tries to hot-plug more memory than we
>> >> can allocate, or tries to open a QCOW2 image with a huge L1 table.
>> >> 
>> >> Guest memory allocation used to have the "immediately fatal" policy
>> >> baked in at a fairly low level, but it's since been lifted into callers;
>> >> see commit c261d77..fc7a580 and fixups 4f96676..0bdaa3a.  During review
>> >> of the latter, Peter Crosthwaite called out the &error_fatal in the
>> >> realize methods and their supporting code.  I agreed with him back then
>> >> that the errors should really be propagated.  But now I've changed my
>> >> mind: I think we should treat these memory allocation failures like
>> >> we've always treated them, namely report and exit(1).  Except for
>> >> "large" allocations, where we have a higher probability of failure, and
>> >> a more realistic chance to recover safely.
>> >> 
>> >> Can we agree that passing &error_fatal to memory_region_init_ram() &
>> >> friends is basically okay even in realize() methods and their supporting
>> >> code?
>> >
>> > I'd say it depends if they can be hotplugged; I think anything
>> > that we really
>> > want to hotplug onto real running machines (as opposed to where we're just
>> > hotplugging during setup) we should propagate errors properly.
>> >
>> > And tbh I don't buy the small allocation argument; I think we
>> > should handle them
>> > all; in my utopian world a guest wouldn't die unless there was no way out.
>> 
>> I guess in Utopia nobody ever makes stupid coding mistakes, the error
>> paths are all covered by a comprehensive test suite, and they make the
>> code prettier, too.  Oh, and kids always eat their vegetables without
>> complaint.
>
> Yes, it's lovely.
>
>> However, we don't actually live in Utopia.  In our world, error paths
>> clutter the code, are full of bugs, and the histogram of their execution
>> counts in testing (automated or not) has a frighteningly tall bar at
>> zero.
>> 
>> We're not going to make this problem less severe by making it bigger.
>> In fact, we consciously decided to hack off a big chunk with an axe:
>> 
>> commit 8a1d02aba9f986ca03d854184cd432ee98bcd179
>> Author: aliguori <aliguori@c046a42c-6fe2-441c-8c8c-71466251a162>
>> Date:   Thu Feb 5 22:05:49 2009 +0000
>> 
>>     Terminate emulation on memory allocation failure (Avi Kivity)
>>     
>>     Memory allocation failures are a very rare condition on virtual-memory
>>     hosts.  They are also very difficult to handle correctly (especially in a
>>     hardware emulation context).  Because of this, it is better to gracefully
>>     terminate emulation rather than executing untested or even unwritten 
>> recovery
>>     code paths.
>>     
>>     This patch changes the qemu memory allocation routines to terminate 
>> emulation
>>     if an allocation failure is encountered.
>>     
>>     Signed-off-by: Avi Kivity <a...@redhat.com>
>>     Signed-off-by: Anthony Liguori <aligu...@us.ibm.com>
>>     
>>     
>>     git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@6526 
>> c046a42c-6fe2-441c-8c8c-71466251a162
>> 
>> Let me elaborate a bit on Avi's arguments:
>> 
>> * Memory allocations are very, very common.  I count at least 2500,
>>   Memory allocation failure is easily the most common *potential* error,
>>   both statically and dynamically.
>> 
>> * Error recovery is always tedious and often hard.  Especially when the
>>   error happens in the middle of a complex operation that needs to be
>>   fully rolled back to recover.  A sensible approach is to acquire
>>   resources early, when recovery is still relatively easy, but that's
>>   often impractical for memory.  This together with the ubiquitousness
>>   of memory allocation makes memory allocation failure even harder to
>>   handle than other kinds of errors.
>> 
>> * Not all allocations are equal.  When an attempt to allocate a gigabyte
>>   fails gracefully, there's a good chance that ordinary code can go on
>>   allocating and freeing kilobytes as usual.  But when an attempt to
>>   allocate kilobytes fails, it's very likely that handling this failure
>>   gracefully will only lead to another one, and another one, until some
>>   buggy error handling puts the doomed process out of its misery.
>> 
>>   Out of the ~2400 memory allocations that go through GLib, 59 can fail.
>>   The others all terminate the process.
>> 
>> * How often do we see failures from these other 2300+?  Bug reports from
>>   users?  As far as I can see, they're vanishingly rare.
>> 
>> * Reviewing and testing the error handling chains rooted at 59
>>   allocations is hard enough, and I don't think we're doing particularly
>>   well on the testing.  What chance would we have with 2300+ more?
>> 
>>   2300+ instances of a vanishingly rare error with generally complex
>>   error handling and basically no test coverage: does that sound more
>>   useful than 2300+ instances of a vanishingly rare error that kills the
>>   process?  If yes, how much of our limited resources is the difference
>>   worth?
>> 
>> * You might argue that we don't have to handle all 2300+ instances, only
>>   the ones reachable from hotplug.  I think that's a pipe dream.  Once
>>   you permit "terminate on memory allocation failure" in general purpose
>>   code, it hides behind innocent-looking function calls.  Even if we
>>   could find them all, we'd still have to add memory allocation failure
>>   handling to lots of general purpose code just to make it usable for
>>   hot plug.  And then we'd get to keep finding them all forever.
>
> I didn't say it was easy :-)
>
>> I don't think handling all memory allocation failures gracefully
>> everywhere or even just within hotplug is practical this side of Utopia.
>> I believe all we could achieve trying is an illusion of graceful
>> handling that is sufficiently convincing to let us pat on our backs,
>> call the job done, and go back to working on stuff that matters to
>> actual users.
>
> Handling them all probably isn't; handling some well defined cases is
> probably possible.
>
> Avi's argument is 6 years old, I suggest a few things have changed in that
> time:
>   a) We now use the Error** mechanism in a lot of places - so a lot of
> code already is supposed to deal with a function call failing;  if a function
> already has an error return and the caller deals with it, then making
> the function deal with an allocation error and the caller handle it is
> a lot easier.


You still have to get the error path within the failing function right,
and tested.

>   b) The use of hotplug is now common - people really hate it when their
> nice, happy working VM dies when they try and do something to it, like
> hotplug or migrate.
>   c) I suspect (but don't know) that people are pushing the number of VMs on
> a host harder than they used to, but there again memory got cheap.
>
> I'm not that eager to protect every allocation; but in some common
> cases, where we already have Error** paths and it's relatively simple,
> then I think we should.

I won't argue we can't do more than 59.

I'm working on a patch adding FIXMEs to a few functions whose error
paths look relatively simple, but are in fact absolutely wrong.

> (OK, to be honest I think we should protect every allocation - but I do
> have sympathy with the complexity/testing arguments).
>
> Dave
>
>> My current working assumption is that passing &error_fatal to
>> memory_region_init_ram() & friends is okay even in realize() methods and
>> their supporting code, except when the allocation can be large.  Even
>> then, &error_fatal is better than buggy recovery code (which I can see
>> all over the place, but that's a separate topic).
>
>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] Error handling in realize() methods

Reply via email to