Re: [PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event"

Alex Bennée Tue, 12 Mar 2024 14:04:24 -0700

"Nicholas Piggin" <npig...@gmail.com> writes:

> On Tue Mar 12, 2024 at 11:33 PM AEST, Alex Bennée wrote:
>> Nicholas Piggin <npig...@gmail.com> writes:
>>
>> > This reverts commit 1f881ea4a444ef36a8b6907b0b82be4b3af253a2.
>> >
>> > That commit causes reverse_debugging.py test failures, and does
>> > not seem to solve the root cause of the problem x86-64 still
>> > hangs in record/replay tests.
>>
>> I'm still finding the reverse debugging tests failing with this series.
>
> :(
>
> In gitlab CI or your own testing? What are you running exactly?


My own - my mistake I didn't get a clean build because of the format
bug. However I'm seeing new failures:

  env QEMU_TEST_FLAKY_TESTS=1 AVOCADO_TIMEOUT_EXPECTED=1 ./pyvenv/bin/avocado 
run ./tests/avocado/reverse_debugging.py
  Fetching asset from 
./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt
  JOB ID     : bd4b29f7afaa24dc6e32933ea9bc5e46bbc3a5a4
  JOB LOG    : 
/home/alex/avocado/job-results/job-2024-03-12T20.58-bd4b29f/job.log
   (1/5) 
./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_pc: 
PASS (4.49 s)
   (2/5) 
./tests/avocado/reverse_debugging.py:ReverseDebugging_X86_64.test_x86_64_q35: 
PASS (4.50 s)
   (3/5) 
./tests/avocado/reverse_debugging.py:ReverseDebugging_AArch64.test_aarch64_virt:
 FAIL: Invalid PC (read ffff2d941e4d7f28 instead of ffff2d941e4d7f2c) (3.06 s)
   (4/5) 
./tests/avocado/reverse_debugging.py:ReverseDebugging_ppc64.test_ppc64_pseries: 
PASS (16.48 s)
   (5/5) 
./tests/avocado/reverse_debugging.py:ReverseDebugging_ppc64.test_ppc64_powernv: 
FAIL: Invalid PC (read 30020210 instead of 30020214) (2.80 s)
  RESULTS    : PASS 3 | ERROR 0 | FAIL 2 | SKIP 0 | WARN 0 | INTERRUPT 0 | 
CANCEL 0
  JOB TIME   : 32.18 s

>
> I found 440fx must have some bug becaues it's quite bad, but
> q35 was more stable.
>
>>
>> > The problem with short-cutting the iowait that was taken during
>> > record phase is that related events will not get consumed at the
>> > same points (e.g., reading the clock).
>> >
>> > A hang with zero icount always seems to be a symptom of an earlier
>> > problem that has caused the recording to become out of synch with
>> > the execution and consumption of events by replay.
>>
>> Would it be possible to still detect the failure mode rather than a full
>> revert?
>
> I'm not actually sure exactly how this patch causes test failures
> (or how it was fixing the ones you saw). If I can reproduce the
> ones you see I could look a bit deeper at it.
>
> I have been looking at some ways to try detect/report record/replay
> errors more sanely because it can be very hard to debug them at
> the moment. No patches quite ready to post yet, but yes if there's
> something we could do here to help we should.
>
> Thanks,
> Nick
>
>>
>> >
>> > Signed-off-by: Nicholas Piggin <npig...@gmail.com>
>> > ---
>> >  include/sysemu/replay.h      |  5 -----
>> >  accel/tcg/tcg-accel-ops-rr.c |  2 +-
>> >  replay/replay.c              | 21 ---------------------
>> >  3 files changed, 1 insertion(+), 27 deletions(-)
>> >
>> > diff --git a/include/sysemu/replay.h b/include/sysemu/replay.h
>> > index f229b2109c..8102fa54f0 100644
>> > --- a/include/sysemu/replay.h
>> > +++ b/include/sysemu/replay.h
>> > @@ -73,11 +73,6 @@ int replay_get_instructions(void);
>> >  /*! Updates instructions counter in replay mode. */
>> >  void replay_account_executed_instructions(void);
>> >  
>> > -/**
>> > - * replay_can_wait: check if we should pause for wait-io
>> > - */
>> > -bool replay_can_wait(void);
>> > -
>> >  /* Processing clocks and other time sources */
>> >  
>> >  /*! Save the specified clock */
>> > diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
>> > index 894e73e52c..a942442a33 100644
>> > --- a/accel/tcg/tcg-accel-ops-rr.c
>> > +++ b/accel/tcg/tcg-accel-ops-rr.c
>> > @@ -109,7 +109,7 @@ static void rr_wait_io_event(void)
>> >  {
>> >      CPUState *cpu;
>> >  
>> > -    while (all_cpu_threads_idle() && replay_can_wait()) {
>> > +    while (all_cpu_threads_idle()) {
>> >          rr_stop_kick_timer();
>> >          qemu_cond_wait_bql(first_cpu->halt_cond);
>> >      }
>> > diff --git a/replay/replay.c b/replay/replay.c
>> > index b8564a4813..895fa6b67a 100644
>> > --- a/replay/replay.c
>> > +++ b/replay/replay.c
>> > @@ -451,27 +451,6 @@ void replay_start(void)
>> >      replay_enable_events();
>> >  }
>> >  
>> > -/*
>> > - * For none/record the answer is yes.
>> > - */
>> > -bool replay_can_wait(void)
>> > -{
>> > -    if (replay_mode == REPLAY_MODE_PLAY) {
>> > -        /*
>> > -         * For playback we shouldn't ever be at a point we wait. If
>> > -         * the instruction count has reached zero and we have an
>> > -         * unconsumed event we should go around again and consume it.
>> > -         */
>> > -        if (replay_state.instruction_count == 0 && 
>> > replay_state.has_unread_data) {
>> > -            return false;
>> > -        } else {
>> > -            replay_sync_error("Playback shouldn't have to iowait");
>> > -        }
>> > -    }
>> > -    return true;
>> > -}
>> > -
>> > -
>> >  void replay_finish(void)
>> >  {
>> >      if (replay_mode == REPLAY_MODE_NONE) {

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: [PATCH v4 05/24] Revert "replay: stop us hanging in rr_wait_io_event"

Reply via email to