Stefan Hajnoczi <stefa...@gmail.com> writes:
> This pull request causes the following CI failure: > > https://gitlab.com/qemu-project/qemu/-/jobs/3328449477 > > I haven't figured out the root cause of the failure. Maybe the pull > request just exposes a latent failure. Please take a look and we can > try again for -rc2. OK after a lot of digging I've come to the following conclusion: * the Fuloong 2E machine never enables the FIFO on the 16550 (s->fcr & UART_FCR_FE) * as a result if qemu_chr_fe_write(&s->chr, &s->tsr, 1) fails with -EAGAIN - a serial_watch_cb is queued - s->tsr_retry++ * additional serial_ioport_write's overwrite s->thr * the console output gets corrupted You can see the effect by comparing the serial write and xmit values: ➜ grep serial_write alex.log | cut -d ' ' -f 6 | xxd -r -p | head -n 10 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.16.0-6-loongson-2e (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 Debian 3.16.56-1+deb8u1 (2018-05-08) [ 0.000000] memsize=256, highmemsize=0 [ 0.000000] CpuClock = 533080000 [ 0.000000] bootconsole [early0] enabled [ 0.000000] CPU0 revision is: 00006302 (ICT Loongson-2) [ 0.000000] FPU revision is: 00000501 [ 0.000000] Checking for the multiply/shift bug... no. 🕙18:27:17 alex@zen:qemu.git/builds/all on pr/141122-misc-for-7.2-1 [$!?⇕] ➜ grep serial_xmit alex.log | cut -d ' ' -f 2 | xxd -r -p | head -n 10 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.16.0-6-loongson-2e (debian-ker...@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 Debian 33 0.000000] bootconsole [early0] enabled [ 0.000000] CPU0 revision is: 00006302 (ICT Loongson-2) [ 0.000000] FPU revision is: 00000501 [ 0.000000] Checking for the multiply/shift bug... no. [ 0.000000] Checking for the daddiu bug... no. [ 0.000000] Determined physical RAM map: [ 0.000000] memory: 000 As a result the check for the pattern fails: console_pattern = 'Kernel command line: %s' % kernel_command_line self.wait_for_console_pattern(console_pattern) resulting in a timeout and test fail. In effect the configuration makes the output dependent on how fast the avocado test can drain the socket as there is no buffering elsewhere in the system. The changes in: Subject: [PULL 02/10] tests/avocado: improve behaviour waiting for login prompts makes this failure more likely to happen - I think because the .peek() and .readline() behaviour have different buffering strategies. Options include: - enable the 16550 FIFO for the Loognson kernel (command line option?) - increase the buffering of the python socket.socket() code I can get it to pass by shuffling the time.sleep() and a few other checks around but that seems flaky at best. -- Alex Bennée