Re: [PATCH v4 2/3] module: add Error arguments to module_load_one and module_load_qom_one

Claudio Fontana Fri, 16 Sep 2022 08:07:13 -0700

On 9/16/22 16:26, Markus Armbruster wrote:
> Claudio Fontana <cfont...@suse.de> writes:
> 
>> On 9/16/22 11:27, Markus Armbruster wrote:
>>> Claudio Fontana <cfont...@suse.de> writes:
>>>
>>>> improve error handling during module load, by changing:
>>>>
>>>> bool module_load_one(const char *prefix, const char *lib_name);
>>>> void module_load_qom_one(const char *type);
>>>>
>>>> to:
>>>>
>>>> bool module_load_one(const char *prefix, const char *name, Error **errp);
>>>> bool module_load_qom_one(const char *type, Error **errp);
>>>>
>>>> module_load_qom_one has been introduced in:
>>>>
>>>> commit 28457744c345 ("module: qom module support"), which built on top of
>>>> module_load_one, but discarded the bool return value. Restore it.
>>>>
>>>> Adapt all callers to emit errors, or ignore them, or fail hard,
>>>> as appropriate in each context.
>>>
>>> How exactly does behavior change?  The commit message is mum on the
>>> behavior before the patch, and vague on the behavior afterwards.
>>
>> The current behavior is difficult to describe as it is basically devoid of 
>> any error handling for QOM modules.
>> What error handling facilities were potentially available before were 
>> rendered moot by commit 28457744c345.
> 
> *Sigh*
> 
>> the behavior changes in this way:
>>
>> audio: when attempting to load an audio module, now actually report module 
>> load errors.
>> Previous behavior: say nothing, and leave it to the user to figure out what 
>> is wrong by failures later on during execution.
>>
>> block: when attempting to load a block module, now actually report module 
>> load errors.
>> Previous behavior: say nothing, and leave it to the user to figure out what 
>> is wrong by failures later on during execution.
>>
>> block/dmg: specifically for the dmg module, which contains submodules 
>> dmg-bz2 and dmg-lzfse, now also warn that if the submodules are not present, 
>> the corresponding decompression will not be available.
>> Previous behavior: say nothing, and leave it to the user to figure out what 
>> is wrong by failures later on during execution.
>>
>> qdev: when creating a new qdev Device object (DeviceState), actually report 
>> module load errors. If a module cannot be loaded to create that device, 
>> abort execution as intended.
>> QEMU users of qdev assume that qdev_new() returns non-NULL. There is a 
>> separate qdev_try_new() function whose only difference from qdev_new() is 
>> that it can return NULL rather than asserting.
>> (see include/hw/qdev-core.h :: qdev_try_new)
>>
>> Previous behavior: say nothing, and leave it to the user to figure out what 
>> is wrong by failures later on during execution (likely, just segfault).
>>
>> qom/object.c : when initializing a QOM object, actually report module load 
>> errors.
>> Previous behavior: say nothing, and leave it to the user to figure out what 
>> is wrong by failures later on during execution.
>>
>> qom/object.c : when looking up an ObjectClass by name 
>> (module_object_class_by_name), actually report module load errors.
>> Previous behavior: say nothing, and leave it to the user to figure out what 
>> is wrong by failures later on during execution.
>>
>> qtest: when processing the "module_load" qtest command, if there is an 
>> actual error in the load of the module, report it in addition to sending a 
>> "FAIL" response.
>>
>> console: when attempting to load a display module, now actually report 
>> module load errors.
>> Previous behavior: say nothing, and leave it to the user to figure out what 
>> is wrong by failures later on during execution.
>>
>> util/module: provide a boolean return value and an Error parameter to 
>> module_load_qom_one and provide an Error parameter to module_load_one what 
>> was missing it.
>> Report an error using the Error facilities when module is present, but fails 
>> to be loaded due to an error occurring.
>> Also report an error if multiple modules are loadable and try to provide the 
>> same type.
>>
>> Previous behavior:
>> difficult to describe, a really confused implementation.
>>
>> The module_load_file part of the code seemed to assume being called 
>> separately, and was checking the caller arguments as though it would be a 
>> self-contained API,
>> rechecking the suffix of the string for a valid ".so" (CONFIG_HOST_DSOSUF), 
>> when it is called that way explicitly and only by module_load_one as a 
>> static function.
>>
>> It was printing an error with fprintf specifically only in the case where 
>> g_module_open failed, and never for any other error condition.
>> It only printed this error if the boolean "mayfail" was set to false, but 
>> the argument was always passed as false, in all execution paths.
>> It was not reporting any error if multiple modules are present all providing 
>> the same type.
> 
> What a mess.
> 
>> Let me know if you have further questions, what I described is a summary of 
>> all the changes contained. I can add to the commit message if necessary.
> 
> Yes, please.


Will do.

> 
>>>> Signed-off-by: Claudio Fontana <cfont...@suse.de>
>>>> ---
>>>>  audio/audio.c         |   9 ++-
>>>>  block.c               |  15 ++++-
>>>>  block/dmg.c           |  18 +++++-
>>>>  hw/core/qdev.c        |  10 ++-
>>>>  include/qemu/module.h |  38 ++++++++++--
>>>>  qom/object.c          |  18 +++++-
>>>>  softmmu/qtest.c       |   6 +-
>>>>  ui/console.c          |  18 +++++-
>>>>  util/module.c         | 140 ++++++++++++++++++++++++------------------
>>>>  9 files changed, 194 insertions(+), 78 deletions(-)
>>>>
>>>> diff --git a/audio/audio.c b/audio/audio.c
>>>> index 76b8735b44..cff7464c07 100644
>>>> --- a/audio/audio.c
>>>> +++ b/audio/audio.c
>>>> @@ -72,6 +72,7 @@ void audio_driver_register(audio_driver *drv)
>>>>  audio_driver *audio_driver_lookup(const char *name)
>>>>  {
>>>>      struct audio_driver *d;
>>>> +    Error *local_err = NULL;
>>>>  
>>>>      QLIST_FOREACH(d, &audio_drivers, next) {
>>>>          if (strcmp(name, d->name) == 0) {
>>>> @@ -79,7 +80,13 @@ audio_driver *audio_driver_lookup(const char *name)
>>>>          }
>>>>      }
>>>>  
>>>> -    audio_module_load_one(name);
>>>> +    if (!audio_module_load_one(name, &local_err)) {
>>>> +        if (local_err) {
>>>> +            error_report_err(local_err);
>>>> +        }
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>>      QLIST_FOREACH(d, &audio_drivers, next) {
>>>>          if (strcmp(name, d->name) == 0) {
>>>>              return d;
>>>> diff --git a/block.c b/block.c
>>>> index bc85f46eed..8b610c6d95 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -464,7 +464,14 @@ BlockDriver *bdrv_find_format(const char *format_name)
>>>>      /* The driver isn't registered, maybe we need to load a module */
>>>>      for (i = 0; i < (int)ARRAY_SIZE(block_driver_modules); ++i) {
>>>>          if (!strcmp(block_driver_modules[i].format_name, format_name)) {
>>>> -            block_module_load_one(block_driver_modules[i].library_name);
>>>> +            Error *local_err = NULL;
>>>> +            if 
>>>> (!block_module_load_one(block_driver_modules[i].library_name,
>>>> +                                       &local_err)) {
>>>> +                if (local_err) {
>>>> +                    error_report_err(local_err);
>>>> +                }
>>>> +                return NULL;
>>>> +            }
>>>>              break;
>>>>          }
>>>>      }
>>>
>>> Before the patch, bdrv_find_format() fails silently[*].
>>>
>>> Afterwards, it reports an error on some failures, but not on others.
>>> Sure this is what we want?
>>
>> Yes, I am sure. It only reports an error if an error exists.
>> When a module is not present, no error should be printed in this context.
> 
> Callers commonly look like this (this one is in bdrv_fill_options()):
> 
>         drv = bdrv_find_format(drvname);
>         if (!drv) {
>             error_setg(errp, "Unknown driver '%s'", drvname);
>             return -ENOENT;
>         }
> 
> where @errp is a parameter.
> 
> The patch changes bdrv_find_format() to report some failures with
> error_report_err().  This is effectively the same anti-pattern I pointed
> out below, starting with "Uh-oh": use of error_report() or equivalent
> within a function with an Error ** parameter.
> 
> If the "Unknown driver" Error object is ultimately passed to
> error_report_err(), we report two errors if module loading fails, else
> one.
> 
> Emitting two error messages for the same problem, a useful one and a
> useless one, is bad UI, but it's arguably still better than just a
> useless one.

I'll see if it makes sense to merge them.
I wanted to avoid removing the existing Unknown driver / or Unknown Protocol 
errors
which people might be relying on.

> 
> But what if a caller wants to handle errors without reporting them?
> This is a legitimate use of the Error API.  No go, we spam stderr
> regardless.

I suppose the Error API can be directed to something else than stderr?

> 
> What if a caller wants to report the error in some other way, say write
> it to a log file?  No go, we still spam stderr.
> 
> The patch breaks on of the Error API's design maxims.  error.h's big
> comment:
> 
>  * - Separation of concerns: the function is responsible for detecting
>  *   errors and failing cleanly; handling the error is its caller's
>  *   job.  [...]
> 
> The patch changes bdrv_find_options() to no longer satisfy this rule: it
> doesn't leave handling entirely to the caller anymore.
> 
> You might argue that no existing caller does such things.  Fine.  Then
> I'll ask you to document that bdrv_fill_options() may only be used in
> certain ways, unlike other functions using the Error API.  Same for all
> the other affected functions.  And then Kevin will probably NAK the
> whole thing :)
> 
>>>> @@ -976,7 +983,11 @@ BlockDriver *bdrv_find_protocol(const char *filename,
>>>>      for (i = 0; i < (int)ARRAY_SIZE(block_driver_modules); ++i) {
>>>>          if (block_driver_modules[i].protocol_name &&
>>>>              !strcmp(block_driver_modules[i].protocol_name, protocol)) {
>>>> -            block_module_load_one(block_driver_modules[i].library_name);
>>>> +            Error *local_err = NULL;
>>>> +            if 
>>>> (!block_module_load_one(block_driver_modules[i].library_name,
>>>> +                                       &local_err) && local_err) {
>>>
>>> Break the line before && local_err, please, like you do elsewhere.
>>
>> This is to avoid hitting the 80 chars on one line, as warned by checkpatch. 
>> Should I disregard checkpatch, is it not used anymore?
> 
> Elsewhere, you format like this:
> 
>                  if 
> (!block_module_load_one(block_driver_modules[i].library_name,
>                                             &local_err)
>                      && local_err) {
> 
> Better, isn't it?

I would not say so, I have no preference, but I can change it.

> 
>>
>>>
>>>> +                error_report_err(local_err);
>>>> +            }
>>>>              break;
>>>>          }
>>>>      }
>>>
>>> Uh-oh: error_report() or equivalent in a function with an Error **
>>> parameter.  This is almost always wrong.  Shouldn't you pass the error
>>> to the caller?
>>
>> I guess this is the "almost" case. No I should not pass anything back.
>>
>> The Error **errp parameter of bdrv_find_protocol, and the Error local_err 
>> parameter inside it are different errors.
>>
>> The first is about whether a protocol has been found or not.
>> The second is about whether there was an error during the load of a module.
>>
>> No, local_err should not be passed back to the caller in my view. The error 
>> passed back to the caller is errp and is already set correctly.
> 
> Part of bdrv_find_protocol()'s (unstated) contract is "either succeed
> and return non-null, or fail, return null, and set an error."
> 
> This is pretty much ubiquitous among Error-using functions.  error.h's
> big comment:
> 
>  * - On success, the function should not touch *errp.  On failure, it
>  *   should set a new error, e.g. with error_setg(errp, ...), or
>  *   propagate an existing one, e.g. with error_propagate(errp, ...).
> 
> Note there is no third case "except on some failures, don't set an
> error".
> 
> Of course, bdrv_find_protocol() might be an exception, i.e. I still owe
> you proof of my claim about its contract.  So let's examine callers:
> 
> * print_block_option_help() in qemu-img.c:
> 
>             proto_drv = bdrv_find_protocol(filename, true, &local_err);
>             if (!proto_drv) {
>                 error_report_err(local_err);
>                 qemu_opts_free(create_opts);
>                 return 1;
>             }
> 
>   If bdrv_find_protocol() returns null without setting @local_err,
>   error_report_err() will crash.

I don't see how this patch changes anything here, or why it relates to this 
code at all.

No, this patch does not change bdrv_find_protocol() unwritten claim about the 
Error parameter.

I think you should re-read the code honestly.

> 
>   Two more instances elsewhere in qemu-img.c
> 
> * bdrv_create_file() in block.c:
> 
>         drv = bdrv_find_protocol(filename, true, errp);
>         if (drv == NULL) {
>             return -ENOENT;
>         }
> 
>   If bdrv_find_protocol() returns null without setting @local_err, then
>   bdrv_create_file() returns -ENOENT without setting an error.

Again, I do not see how this is an argument pertinent to this patch.

> 
>   bdrv_create_file() is called by a bunch of BlockDriver
>   .bdrv_co_create_opts() methods, e.g. raw_co_create_opts():
> 
>         return bdrv_create_file(filename, opts, errp);
> 
>   The method is called by bdrv_create_co_entry():
> 
>         ret = cco->drv->bdrv_co_create_opts(cco->drv,
>                                             cco->filename, cco->opts, 
> &local_err);
>         error_propagate(&cco->err, local_err);
>         cco->ret = ret;
> 
>        If bdrv_find_protocol() returns null without setting an error, then
>        we set cco->ret = -ENOENT and cco->err = NULL.
> 
>        One of its users is bdrv_create():
> 
>         if (qemu_in_coroutine()) {
>             /* Fast-path if already in coroutine context */
>             bdrv_create_co_entry(&cco);
>         } else {
>             co = qemu_coroutine_create(bdrv_create_co_entry, &cco);
>             qemu_coroutine_enter(co);
>             while (cco.ret == NOT_DONE) {
>                 aio_poll(qemu_get_aio_context(), true);
>             }
>         }
> 
>         ret = cco.ret;
>         if (ret < 0) {
>             if (cco.err) {
>                 error_propagate(errp, cco.err);
>             } else {
>                 error_setg_errno(errp, -ret, "Could not create image");
>             }
>         }
> 
>     out:
>         g_free(cco.filename);
>         return ret;
> 
>   If bdrv_find_protocol() returns null without setting an error, then
>   bdrv_create() returns -ENOENT without setting an error.
> 
>   One of its callers is bdrv_append_temp_snapshot():
> 
>     ret = bdrv_create(&bdrv_qcow2, tmp_filename, opts, errp);
>     qemu_opts_del(opts);
>     if (ret < 0) {
>         error_prepend(errp, "Could not create temporary overlay '%s': ",
>                       tmp_filename);
>         goto out;
>     }
> 
>   Crash.
> 
> Adding failure modes that use @errp differently than existing ones is
> playing with fire :)


Except this patch has nothing to do with the material presented here.

> 
>>> Please check all uses of your FOO_load_one() for this issue.

Will do again, but to me it looks unrelated.

>>
>> I have checked all instances multiple times. If there are issues please 
>> present them so I can address them.
> 
> I humbly suggest that instances in functions that set an error on
> failures before this patch need re-checking.

Will go over it again.

> 
>>>> diff --git a/block/dmg.c b/block/dmg.c
>>>> index 98db18d82a..11d184d39c 100644
>>>> --- a/block/dmg.c
>>>> +++ b/block/dmg.c
>>>> @@ -434,6 +434,7 @@ static int dmg_open(BlockDriverState *bs, QDict 
>>>> *options, int flags,
>>>>      uint64_t plist_xml_offset, plist_xml_length;
>>>>      int64_t offset;
>>>>      int ret;
>>>> +    Error *local_err = NULL;
>>>>  
>>>>      ret = bdrv_apply_auto_read_only(bs, NULL, errp);
>>>>      if (ret < 0) {
>>>> @@ -446,8 +447,21 @@ static int dmg_open(BlockDriverState *bs, QDict 
>>>> *options, int flags,
>>>>          return -EINVAL;
>>>>      }
>>>>  
>>>> -    block_module_load_one("dmg-bz2");
>>>> -    block_module_load_one("dmg-lzfse");
>>>> +    if (!block_module_load_one("dmg-bz2", &local_err)) {
>>>> +        if (local_err) {
>>>> +            error_report_err(local_err);
>>>> +            return -EINVAL;
>>>> +        }
>>>> +        warn_report("dmg-bz2 module not present, bz2 decomp unavailable");
>>>> +    }
>>>> +    local_err = NULL;
>>>> +    if (!block_module_load_one("dmg-lzfse", &local_err)) {
>>>> +        if (local_err) {
>>>> +            error_report_err(local_err);
>>>> +            return -EINVAL;
>>>> +        }
>>>> +        warn_report("dmg-lzfse module not present, lzfse decomp 
>>>> unavailable");
>>>> +    }
>>>>  
>>>>      s->n_chunks = 0;
>>>>      s->offsets = s->lengths = s->sectors = s->sectorcounts = NULL;
>>>> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
>>>> index 0806d8fcaa..5902c59c94 100644
>>>> --- a/hw/core/qdev.c
>>>> +++ b/hw/core/qdev.c
>>>> @@ -148,7 +148,15 @@ bool qdev_set_parent_bus(DeviceState *dev, BusState 
>>>> *bus, Error **errp)
>>>>  DeviceState *qdev_new(const char *name)
>>>>  {
>>>>      if (!object_class_by_name(name)) {
>>>> -        module_load_qom_one(name);
>>>> +        Error *local_err = NULL;
>>>> +        if (!module_load_qom_one(name, &local_err)) {
>>>> +            if (local_err) {
>>>> +                error_report_err(local_err);
>>>> +            } else {
>>>> +                error_report("could not find a module for type '%s'", 
>>>> name);
>>>> +            }
>>>> +            abort();
>>>
>>> Why is this a programming error?
>>
>> As described before in your request for previous/after, this is a 
>> programming error because qdev_new is not expected to fail.
>> All QEMU users of qdev_new expect the API to return non-NULL and immediately 
>> use and dereference the pointer returned.
>> There is a separate API, qdev_try_new, for creating a device and allowing a 
>> NULL return value.
> 
> Well, then the programming error is "adding loadable device modules
> broke qdev_new()".
> 
> This is qdev_new() before commit 7ab6e7fcce "qdev: device module
> support":
> 
>     DeviceState *qdev_new(const char *name)
>     {
>         return DEVICE(object_new(name));
>     }
> 
> Precondition: @name is a valid device type name.  If it isn't, we fail
> an assertion.
> 
> If you got a @name that may not be valid, you must use qdev_try_new()
> instead:
> 
>     DeviceState *qdev_try_new(const char *name)
>     {
>         if (!object_class_by_name(name)) {
>             return NULL;
>         }
> 
>         return DEVICE(object_new(name));
>     }
> 
> The guard ensures we call object_new() only when the precondition holds.
> 
> Except it doesn't: when @name is a valid type name, but not a *device*
> type name, we still crash.  Bug.
> 
> Commit 7ab6e7fcce changed qdev_new() to
> 
>     DeviceState *qdev_new(const char *name)
>     {
>         if (!object_class_by_name(name)) {
>             module_load_qom_one(name);
>         }
>         return DEVICE(object_new(name));
>     }
> 
> Now the (enforced!) precondition is @name is a valid device type name
> and the device is either compiled in or it will succeed to load.
> 
> "Will succeed to load" is an inappropriate precondition.
> 
> Devices compiled as modules must be instantiated with qdev_try_new().
> Loadable module support should be dropped from qdev_new().

This should be a separate series / patch in my view, with bigger impact on the 
whole code.

> 
> Not your patch's fault, of course.  By the way, welcome to the swamp!
> Do not mind the crocodiles, they just want to play.
> 
>>>> +        }
>>>>      }
>>>>      return DEVICE(object_new(name));
>>>>  }
>>>> diff --git a/include/qemu/module.h b/include/qemu/module.h
>>>> index 8c012bbe03..78d4c4de96 100644
>>>> --- a/include/qemu/module.h
>>>> +++ b/include/qemu/module.h
>>>> @@ -61,16 +61,44 @@ typedef enum {
>>>>  #define fuzz_target_init(function) module_init(function, \
>>>>                                                 MODULE_INIT_FUZZ_TARGET)
>>>>  #define migration_init(function) module_init(function, 
>>>> MODULE_INIT_MIGRATION)
>>>> -#define block_module_load_one(lib) module_load_one("block-", lib)
>>>> -#define ui_module_load_one(lib) module_load_one("ui-", lib)
>>>> -#define audio_module_load_one(lib) module_load_one("audio-", lib)
>>>> +#define block_module_load_one(lib, errp) module_load_one("block-", lib, 
>>>> errp)
>>>> +#define ui_module_load_one(lib, errp) module_load_one("ui-", lib, errp)
>>>> +#define audio_module_load_one(lib, errp) module_load_one("audio-", lib, 
>>>> errp)
>>>>  
>>>>  void register_module_init(void (*fn)(void), module_init_type type);
>>>>  void register_dso_module_init(void (*fn)(void), module_init_type type);
>>>>  
>>>>  void module_call_init(module_init_type type);
>>>> -bool module_load_one(const char *prefix, const char *lib_name);
>>>> -void module_load_qom_one(const char *type);
>>>> +
>>>> +/*
>>>> + * module_load_one: attempt to load a module from a set of directories
>>>> + *
>>>> + * directories searched are:
>>>> + * - getenv("QEMU_MODULE_DIR")
>>>> + * - get_relocated_path(CONFIG_QEMU_MODDIR);
>>>> + * - /var/run/qemu/${version_dir}
>>>> + *
>>>> + * prefix:         a subsystem prefix, or the empty string ("audio-", 
>>>> ..., "")
>>>> + * name:           name of the module
>>>> + * errp:           error to set in case the module is found, but load 
>>>> failed.
>>>> + *
>>>> + * Return value:   true on success (found and loaded);
>>>> + *                 if module if found, but load failed, errp will be set.
>>>> + *                 if module is not found, errp will not be set.
>>>
>>> I understand you need to distingush two failure modes "found, but load
>>> failed" and "not found".
>>
>> That is correct.
>>
>>>
>>> Functions that set an error on some failures only tend to be awkward: in
>>> addition to checking the return value for failure, you have to check
>>> @errp for special failures.  This is particularly cumbersome when it
>>> requires a @local_err and an error_propagate() just for that.  I
>>> generally prefer to return an error code and always set an error.
>>>
>>> That said, the patch doesn't look bad.  Perhaps it will be once the
>>> issues I pointed out above have been addressed.  Let's not worry about
>>> it right now.
>>>
>>>> + */
>>>> +bool module_load_one(const char *prefix, const char *name, Error **errp);
>>>> +
>>>> +/*
>>>> + * module_load_qom_one: attempt to load a module to provide a QOM type
>>>> + *
>>>> + * type:           the type to be provided
>>>> + * errp:           error to set.
>>>> + *
>>>> + * Return value:   true on success (found and loaded), false on failure.
>>>> + *                 If a module is simply not found for the type,
>>>> + *                 errp will not be set.
>>>> + */
>>>> +bool module_load_qom_one(const char *type, Error **errp);
>>>>  void module_load_qom_all(void);
>>>>  void module_allow_arch(const char *arch);
>>>>  
>>>
>>> [...]
>>>
>>>
>>> [*] Except it prints "Module is not supported by system\n" to stderr
>>> when !g_module_supported(), which doesn't look right to me.
>>>
>> I don't know what this comment relates to.
> 
> It's a footnote attached to "Before the patch, bdrv_find_format() fails
> silently[*]" above.
> 
> 
> I think the root of the problems with this patch is that you convert the
> FOO_module_load_one() functions to the Error API without propagating the
> error far enough.  It really needs to go all the way up.
> 

I respecfully disagree. Sometimes I think we need to be practical; there is for 
sure further improvement possible,
a redesign of the Error API to be more consistent with return values,
rethinking about use of qdev_new vs try_qdev_new() etc,

but in my view the series is complete and self-consistent
(and btw fixes an actual bug in your CentOS docker / kubevirt).

I'll review all the changes and see if merging some errors make sense in some 
cases, but generally I would recommend against it.

Further improvements can and should be done but in my view should not be part 
of this series,
I cannot honestly take at the moment that kind of work item.

Thanks,

Claudio

Re: [PATCH v4 2/3] module: add Error arguments to module_load_one and module_load_qom_one

Reply via email to