* Markus Armbruster (arm...@redhat.com) wrote: > In general, code running withing a realize() method should not exit() on > error. Instad, errors should be propagated through the realize() > method. Additionally, the realize() method should fail cleanly, > i.e. carefully undo its side effects such as wiring of interrupts, > mapping of memory, and so forth. Tedious work, but necessary to make > hot plug safe. > > Quite a few devices don't do that. > > Some of them can be usefully hot-plugged, and for them unclean failures > are simply bugs. I'm going to mark the ones I can find. > > Others are used only as onboard devices, and if their realize() fails, > the machine's init() will exit()[*]. In an ideal world, we'd start with > an empty board and cold-plugg devices, and there, clean failure may be > useful. In the world we live in, making these devices fail cleanly is a > lot of tedious work for no immediate gain. > > Example: machine "kzm" and device "fsl,imx31". fsl_imx31_realize() > returns without cleanup on error. kzm_init() exit(1)s when realize > fails, so the lack of cleanup is a non-issue. > > I think this is basically okay for now, but I'd like us to mark these > devices cannot_instantiate_with_device_add_yet, with /* Reason: > realize() method fails uncleanly */. > > Opinions? > > Next, let's consider the special case "out of memory". > > Our general approach is to treat it as immediately fatal. This makes > sense, because when a smallish allocation fails, the process is almost > certainly doomed anyway. Moreover, memory allocation is frequent, and > attempting to recover from failed memory allocation adds loads of > hard-to-test error paths. These are *dangerous* unless carefully tested > (and we don't). > > Certain important allocations we handle more gracefully. For instance, > we don't want to die when the user tries to hot-plug more memory than we > can allocate, or tries to open a QCOW2 image with a huge L1 table. > > Guest memory allocation used to have the "immediately fatal" policy > baked in at a fairly low level, but it's since been lifted into callers; > see commit c261d77..fc7a580 and fixups 4f96676..0bdaa3a. During review > of the latter, Peter Crosthwaite called out the &error_fatal in the > realize methods and their supporting code. I agreed with him back then > that the errors should really be propagated. But now I've changed my > mind: I think we should treat these memory allocation failures like > we've always treated them, namely report and exit(1). Except for > "large" allocations, where we have a higher probability of failure, and > a more realistic chance to recover safely. > > Can we agree that passing &error_fatal to memory_region_init_ram() & > friends is basically okay even in realize() methods and their supporting > code?
I'd say it depends if they can be hotplugged; I think anything that we really want to hotplug onto real running machines (as opposed to where we're just hotplugging during setup) we should propagate errors properly. And tbh I don't buy the small allocation argument; I think we should handle them all; in my utopian world a guest wouldn't die unless there was no way out. Dave > > [*] Well, the ones that bother to check for errors, but that's a > separate problem. > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK