Philippe Mathieu-Daudé <phi...@redhat.com> writes:
> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: >> On 07.10.2020 11:23, Thomas Huth wrote: >>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >>>> On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote: >>>>> On 10/7/20 1:07 AM, John Snow wrote: >>>>>> I'm seeing this gitlab test fail quite often in my Python work; I >>>>>> don't >>>>>> *think* this has anything to do with my patches, but maybe I need >>>>>> to try >>>>>> and bisect this more aggressively. >>> [...] >>>>> w.r.t. the error in your build, I told Thomas about the >>>>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is >>>>> not his area. Richard has been looking yesterday to see if it is >>>>> a TCG regression, and said the test either finished/crashed raising >>>>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the >>>>> children become zombie and the test hang. >>>> >>>> Expected output: >>>> >>>> Quiescing Open Firmware ... >>>> Booting Linux via __start() @ 0x01000000 ... >>>> >>>> But QEMU exits in replay_char_write_event_load(): >>>> >>>> Quiescing Open Firmware ... >>>> qemu-system-ppc: Missing character write event in the replay log >>>> $ echo $? >>>> 1 >>>> >>>> Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT. >>>> >>>> Replay file is ~22MiB. End of record using "system_powerdown + quit" >>>> in HMP. >>>> >>>> I guess we have 2 bugs: >>>> - replay log >>>> - avocado doesn't catch children exit(1) >>>> >>>> Quick reproducer: >>>> >>>> $ make qemu-system-ppc check-venv >>>> $ tests/venv/bin/python -m \ >>>> avocado --show=app,console,replay \ >>>> run --job-timeout 300 -t machine:mac99 \ >>>> tests/acceptance/replay_kernel.py >>> >>> Thanks, that was helpful. ... and the winner is: >>> >>> commit 55adb3c45620c31f29978f209e2a44a08d34e2da >>> Author: John Snow <js...@redhat.com> >>> Date: Fri Jul 24 01:23:00 2020 -0400 >>> Subject: ide: cancel pending callbacks on SRST >>> >>> ... starting with this commit, the tests starts failing. John, any >>> idea what >>> might be causing this? >> >> This patch includes the following lines: >> >> + aio_bh_schedule_oneshot(qemu_get_aio_context(), >> + ide_bus_perform_srst, bus); >> >> replay_bh_schedule_oneshot_event should be used instead of this >> function, because it synchronizes non-deterministic BHs. > > Why do we have 2 different functions? BH are already complex > enough, and we need to also think about the replay API... > > What about the other cases such vhost-user (blk/net), virtio-blk? This does seem like something that should be wrapped up inside aio_bh_schedule_oneshot itself or maybe we need a aio_bh_schedule_transaction_oneshot to distinguish it from the other uses the function has. -- Alex Bennée