"Nicholas Piggin" <npig...@gmail.com> writes: > On Tue Mar 12, 2024 at 11:33 PM AEST, Alex Bennée wrote: >> Nicholas Piggin <npig...@gmail.com> writes: >> >> > This reverts commit 1f881ea4a444ef36a8b6907b0b82be4b3af253a2. >> > >> > That commit causes reverse_debugging.py test failures, and does >> > not seem to solve the root cause of the problem x86-64 still >> > hangs in record/replay tests. >> >> I'm still finding the reverse debugging tests failing with this series. > > :( > > In gitlab CI or your own testing? What are you running exactly?
My own - my mistake I didn't get a clean build because of the format bug. However I'm seeing new failures: env QEMU_TEST_FLAKY_TESTS=1 AVOCADO_TIMEOUT_EXPECTED=1 ./pyvenv/bin/avocado run ./tests/avocado/reverse_debugging.py Fetching asset from ./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt JOB ID : bd4b29f7afaa24dc6e32933ea9bc5e46bbc3a5a4 JOB LOG : /home/alex/avocado/job-results/job-2024-03-12T20.58-bd4b29f/job.log (1/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_pc: PASS (4.49 s) (2/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_q35: PASS (4.50 s) (3/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt: FAIL: Invalid PC (read ffff2d941e4d7f28 instead of ffff2d941e4d7f2c) (3.06 s) (4/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_ppc64.test_ppc64_pseries: PASS (16.48 s) (5/5) ./tests/avocado/reverse_debugging.py:ReverseDebugging_ppc64.test_ppc64_powernv: FAIL: Invalid PC (read 30020210 instead of 30020214) (2.80 s) RESULTS : PASS 3 | ERROR 0 | FAIL 2 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0 JOB TIME : 32.18 s > > I found 440fx must have some bug becaues it's quite bad, but > q35 was more stable. > >> >> > The problem with short-cutting the iowait that was taken during >> > record phase is that related events will not get consumed at the >> > same points (e.g., reading the clock). >> > >> > A hang with zero icount always seems to be a symptom of an earlier >> > problem that has caused the recording to become out of synch with >> > the execution and consumption of events by replay. >> >> Would it be possible to still detect the failure mode rather than a full >> revert? > > I'm not actually sure exactly how this patch causes test failures > (or how it was fixing the ones you saw). If I can reproduce the > ones you see I could look a bit deeper at it. > > I have been looking at some ways to try detect/report record/replay > errors more sanely because it can be very hard to debug them at > the moment. No patches quite ready to post yet, but yes if there's > something we could do here to help we should. > > Thanks, > Nick > >> >> > >> > Signed-off-by: Nicholas Piggin <npig...@gmail.com> >> > --- >> > include/sysemu/replay.h | 5 ----- >> > accel/tcg/tcg-accel-ops-rr.c | 2 +- >> > replay/replay.c | 21 --------------------- >> > 3 files changed, 1 insertion(+), 27 deletions(-) >> > >> > diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h >> > index f229b2109c..8102fa54f0 100644 >> > --- a/include/sysemu/replay.h >> > +++ b/include/sysemu/replay.h >> > @@ -73,11 +73,6 @@ int replay_get_instructions(void); >> > /*! Updates instructions counter in replay mode. */ >> > void replay_account_executed_instructions(void); >> > >> > -/** >> > - * replay_can_wait: check if we should pause for wait-io >> > - */ >> > -bool replay_can_wait(void); >> > - >> > /* Processing clocks and other time sources */ >> > >> > /*! Save the specified clock */ >> > diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c >> > index 894e73e52c..a942442a33 100644 >> > --- a/accel/tcg/tcg-accel-ops-rr.c >> > +++ b/accel/tcg/tcg-accel-ops-rr.c >> > @@ -109,7 +109,7 @@ static void rr_wait_io_event(void) >> > { >> > CPUState *cpu; >> > >> > - while (all_cpu_threads_idle() && replay_can_wait()) { >> > + while (all_cpu_threads_idle()) { >> > rr_stop_kick_timer(); >> > qemu_cond_wait_bql(first_cpu->halt_cond); >> > } >> > diff --git a/replay/replay.c b/replay/replay.c >> > index b8564a4813..895fa6b67a 100644 >> > --- a/replay/replay.c >> > +++ b/replay/replay.c >> > @@ -451,27 +451,6 @@ void replay_start(void) >> > replay_enable_events(); >> > } >> > >> > -/* >> > - * For none/record the answer is yes. >> > - */ >> > -bool replay_can_wait(void) >> > -{ >> > - if (replay_mode == REPLAY_MODE_PLAY) { >> > - /* >> > - * For playback we shouldn't ever be at a point we wait. If >> > - * the instruction count has reached zero and we have an >> > - * unconsumed event we should go around again and consume it. >> > - */ >> > - if (replay_state.instruction_count == 0 && >> > replay_state.has_unread_data) { >> > - return false; >> > - } else { >> > - replay_sync_error("Playback shouldn't have to iowait"); >> > - } >> > - } >> > - return true; >> > -} >> > - >> > - >> > void replay_finish(void) >> > { >> > if (replay_mode == REPLAY_MODE_NONE) { -- Alex Bennée Virtualisation Tech Lead @ Linaro