virt: Fix error handling in cpu unplug

bibo mao Sat, 22 Mar 2025 03:44:17 -0700



On 2025/3/21 下午4:08, Markus Armbruster wrote:

bibo mao <maob...@loongson.cn> writes:

On 2025/3/21 下午3:21, Markus Armbruster wrote:

bibo mao <maob...@loongson.cn> writes:

+Igor


On 2025/3/21 下午2:47, Markus Armbruster wrote:

Bibo Mao <maob...@loongson.cn> writes:

In function virt_cpu_unplug(), it will send cpu unplug message to
interrupt controller extioi and ipi irqchip. If there is problem in
this function, system should continue to run and keep state the same
before cpu is removed.

If error happends in cpu unplug stage, send cpu plug message to extioi
and ipi irqchip to restore to previous stage, and then return immediately.

Fixes: 2cd6857f6f5b (hw/loongarch/virt: Implement cpu unplug interface)
Signed-off-by: Bibo Mao <maob...@loongson.cn>
---
    hw/loongarch/virt.c | 6 ++++++
    1 file changed, 6 insertions(+)

diff --git a/hw/loongarch/virt.c b/hw/loongarch/virt.c
index 8563967c8b..503362a69e 100644
--- a/hw/loongarch/virt.c
+++ b/hw/loongarch/virt.c
@@ -958,6 +958,8 @@ static void virt_cpu_unplug(HotplugHandler *hotplug_dev,
        hotplug_handler_unplug(HOTPLUG_HANDLER(lvms->extioi), dev, &err);
        if (err) {
            error_propagate(errp, err);
+        hotplug_handler_plug(HOTPLUG_HANDLER(lvms->ipi), dev,
+                             &error_abort);
            return;
        }

@@ -965,6 +967,10 @@ static void virt_cpu_unplug(HotplugHandler *hotplug_dev,

        hotplug_handler_unplug(HOTPLUG_HANDLER(lvms->acpi_ged), dev, &err);
        if (err) {
            error_propagate(errp, err);
+        hotplug_handler_plug(HOTPLUG_HANDLER(lvms->ipi), dev,
+                             &error_abort);
+        hotplug_handler_plug(HOTPLUG_HANDLER(lvms->extioi), dev,
+                             &error_abort);
            return;
        }


virt_cpu_unplug() calls hotplug_handler_unplug() three times to notify
ipi, extioi, and acpi_get.  If any notification fails, virt_cpu_unplug()
calls hotplug_handler_plug() to "un-notify" the preceeding ones, if any.
This must not fail.

virt_cpu_plug() does it the other way round (see previous patch).

So, hotplug_handler_plug() must not fail in virt_cpu_unplug(), yet we
check for it to fail in virt_cpu_plug().

Can it really fail in virt_cpu_plug()?

If yes, why can't it fail in virt_cpu_unplug()?

you can check function acpi_cpu_plug_cb()/loongarch_ipi_cpu_plug(), that
is cpuplug callback for acpi_ged and ipi. it will not fail.

If *virt_cpu_pre_plug()* pass, it will succeed.

Regards
Bibo Mao


Same questions for hotplug_handler_unplug().


Let me restate my argument.

We call hotplug_handler_plug() on the happy path, and on error recovery
paths.  Four cases:

1. Can fail on the happy path

     Error recovery is required.

1.1 Can fail on the error recovery path

      Error recovery is required, but broken.

1.2 Can't fail on the error recovery path

      Error recovery is required and works, but why it works is not
      obvious.  Deserves a comment explaining why hotplug_handler_plug()
      can't fail here even though it can fail on the happy path next door.

2. Can't fail on the happy path

     Error recovery is unreachable.

2.1 Can fail on the error recovery path

      Error recovery is unreachable and broken.  Possibly a time bomb, and
      possibly misleading readers.

2.2 Can't fail on the error recovery path

      Error recovery is unreachable and would work, but why it would work
      is again a not obvious.

Which of the four cases is it?

By my understanding, it is "2. Can't fail on the happy path",  and Error
recovery is unreachable.


Got it.

I have said that it is impossible and recovery is only for future use.

do you mean recovery should be removed? And directly &error_abort is
used in virt_cpu_plug() such as:
static void virt_cpu_plug(HotplugHandler *hotplug_dev,
                            DeviceState *dev, Error **errp)
{
    if (lvms->ipi) {
      hotplug_handler_plug(HOTPLUG_HANDLER(lvms->ipi), dev, &error_abort);


Yes, I prefer this.  Here's why.

This is ok, however I do not think there is obvious advantage. V6 is okalso, there is recovery path only that recovery has problem, system willcrash, through it is impossible now.

Maybe someone jumps out and say that there is error handling in cpuhotplug, however no such in pc_memory_plug(), why does not LoongArch insuch way. If so, there will endless discussion.


void x86_cpu_plug(HotplugHandler *hotplug_dev,
                  DeviceState *dev, Error **errp)
{
    CPUArchId *found_cpu;
    Error *local_err = NULL;
    X86CPU *cpu = X86_CPU(dev);
    X86MachineState *x86ms = X86_MACHINE(hotplug_dev);

    if (x86ms->acpi_dev) {
        hotplug_handler_plug(x86ms->acpi_dev, dev, &local_err);

static void pc_memory_plug(HotplugHandler *hotplug_dev,
                           DeviceState *dev, Error **errp)
{
    PCMachineState *pcms = PC_MACHINE(hotplug_dev);
    X86MachineState *x86ms = X86_MACHINE(hotplug_dev);
    MachineState *ms = MACHINE(hotplug_dev);
    bool is_nvdimm = object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM);

    pc_dimm_plug(PC_DIMM(dev), MACHINE(pcms));

    if (is_nvdimm) {
        nvdimm_plug(ms->nvdimms_state);
    }

    hotplug_handler_plug(x86ms->acpi_dev, dev, &error_abort);
}

Regards
Bibo Mao


Error recovery that is unreachable now but might become reachable at
some future time is untestable now.  Mind, this does not necessarily
make it a bad idea by itself.  But there's more.

Anything that makes this error recovery reachable either breaks it or
makes correctness locally unobvious.  Why?  To make it reachable, plug /
unplug must be able to fail on the happy path.  But then they either can
fail on the error recovery path as well (which breaks error recovery),
or they can't fail there for reasons that are not locally obvious.

This sets a trap for readers.  An attentive reader will see the problem
(like I did), but to see why the code is not broken right now will take
digging (like we did together).  And after such digging, we're left with
a queasy feeling about robustness of the code (like we are now).

Passing &error_abort on the happy path avoids all this.  Instead it
clearly tells the reader that this is not expected to fail.

If failure becomes possible at some future time, this should crash in
testing.  If we neglect to test the new failure (and we really
shouldn't), we crash on error in production right away instead of
risking botched error recovery messing up the program's state.
Both are bad outcomes, but which one's less bad I find impossible to
predict.

Re: [PATCH v6 3/6] hw/loongarch/virt: Fix error handling in cpu unplug

Reply via email to