Re: has anybody looked at the long-standing intermittent raspi2 avocado test failure?

John Snow Wed, 19 Oct 2022 13:38:40 -0700

On Tue, Oct 18, 2022 at 6:23 AM Peter Maydell <peter.mayd...@linaro.org> wrote:
>
> If you run this single avocado test:
>
> while ./build/x86/tests/venv/bin/avocado run
> build/x86/tests/avocado/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd
> ; do true; done
>
> then on my machine it will fail within 4 or 5 iterations. Has anybody
> looked into this? Looking at a log from a failed test
> https://people.linaro.org/~peter.maydell/raspi-avocado-fail.log
> what seems to happen is that the test sends the "halt" command to
> the guest. The DEBUG log reports a transition from RUNNING to
> DISCONNECTING and the QMP connection disconnects, so presumably
> the guest really did exit -- or is that something the test framework
> is doing from its end?


The avocado test calls self.vm.wait(), which goes down this call
chain, all in QEMUMachine (machine.py):

wait()
.shutdown()
.._do_shutdown()
..._soft_shutdown()
...._early_cleanup()
...._close_qmp_connection()

QEMUMachine._early_cleanup, as you've noticed, closes the console
socket right away. It has this to say on the matter:

# If we keep the console socket open, we may deadlock waiting
# for QEMU to exit, while QEMU is waiting for the socket to
# become writable.

If we leave it open, QEMU is not guaranteed to close, and the
synchronous code in machine.py isn't good enough to poll both. This
was one motivation for switching the QMP library to asyncio, but those
changes haven't hit machine.py yet. I am hoping to have it done soon,
but some other tasks of finalizing splitting out the QMP library
itself took priority in my time budget before returning to converting
more utilities to asyncio.

Next, the QMP connection itself is closed, which is that transition
from RUNNING to DISCONNECTING you see from the "protocol" module in
Avocado's log.
(Avocado is truncating the full module names here; the real source
module names are "qemu.qmp.protocol", "qemu.qmp.qmp_client", and
"qemu.machine.machine".)

>                        Anyway, the next thing that happens is
> that after a 30s timeout something (the avocado framework?) sends
> the QEMU process a SIGTERM, and there's a bunch of timeout related
> python backtraces in the log.

It looks like the subprocess wait in _soft_shutdown never returns. We
assume that because wait() was called that the caller is expecting the
VM to exit and has presumably already issued a "quit" or "halt" or
equivalent, so the Machine code does not issue any further commands to
ensure this happens.

At this point, my guess is that -- based on your later issue filings
-- the "halt" command isn't actually getting processed because of the
rapid console disconnect, so the machine.py code deadlocks waiting for
QEMU to exit when it never will.

>
> In comparison, with a pass
> https://people.linaro.org/~peter.maydell/raspi-avocado-pass.log
> the transition from DISCONNECTING to IDLE is followed within about
> 5 seconds by a test pass.
>
> So:
>
> (1) has anybody looked at what's happening in this test in
> particular?
> (2) is it possible for avocado to log all of the communication
> on the serial console? Currently the logging of what the guest
> prints cuts out early (in both success and failure cases),
> and what the test case sends to the guest is not logged at all

The output logging being truncated is due to other problems as you've
noticed in later issue filings.

I think that Input logging could be added by overriding
ConsoleSocket.send() in console_socket.py; see
https://docs.python.org/3/library/socket.html#socket.socket.send for
signature and use _drain_socket and/or recv() as a reference basis.
You might want to not write to the same logfile to avoid an
unreadable, interleaved mess. The data buffers here are, I think, not
line-based and may not produce coherent interleaved output.

Naive and untested:
```
    def send(self, data: bytes, flags: int = 0) -> bytes:
        self._some_other_logfile.write(data)
        self._some_other_logfile.flush()
        return super().send(data, flags)
```

I am working on replacing machine.py with an async version which
should solve some of the socket problems; but if you want a band-aid
in the meantime I'm happy to take any patches to improve what we have
in the short-term. I will probably focus my own efforts on just
getting the rewrite in.

> (3) what's the best way to debug this to figure out what's
> actually happening?
>
> I tried passing timeout=None to the self.vm.wait() call in
> test_arm_raspi2_initrd(), which the comments in machine.py
> is supposed to mean "wait forever", but that didn't actually
> seem to change the timeout. (I had the idea of looking at
> the status of the QEMU process etc at that point.)

Hm ... The stack trace for the TimeoutExpired exception sure makes it
look like the Machine appliance is hitting its own timeout (Should be
at about 30 seconds -- matches what you observed), but if that's the
case, passing timeout=None should have alleviated that -- up until you
hit the Avocado timeout, at least.

I'd expect the traceback output and failure mode to look a little
different depending on which component decides to time out.

>
> thanks
> -- PMM
>

"Thanks for the analysis, but, like, what are you going to do to fix
any of this?"

Long term, an asyncio version of machine.py will help alleviate these
kinds of race conditions when dealing with multiple sockets from
Python as our testing requirements become more complex.

Short term, I'm not sure... I think removing the early socket close
code will just cause new race condition deadlocks elsewhere. I
recommend adding some kind of wait or flush to the socket code after
sending a "halt" to ensure that the command goes through. You could
also decide not to send "halt" and just use vm.shutdown() directly,
which will use QMP to issue 'quit', which always waits to get a
command success acknowledgment from QEMU which should avoid the
problem too.

Re: has anybody looked at the long-standing intermittent raspi2 avocado test failure?

Reply via email to