As I'm a glutton for punishment I thought I'd have a go at fixing the
slowly growing number of record/replay bugs. The two fixes are:

 replay: stop us hanging in rr_wait_io_event
 chardev: force write all when recording replay logs

I think we are beyond 8.2 material but it would be nice to get this
functionality stable again. We have a growing number of bugs under the
icount label on gitlab:

  https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=icount

Changes
-------

v2

Apart from addressing tidy ups and tags I've been investigating the
failures in replay_linux.py which are the more exhaustive tests which
boot the kernel and user-space. The "fix":

  replay: report sync error when no exception in log (!DEBUG INVESTIGATION)

triggers around the time of the hang in the logs and despite the
rather hairy EXCP->INT transitions around cpu_exec_loop() I think
points to a genuine problem. I added the tracing to cputlb to verify
the page tables are the same and started detecting divergence between
record and replay a lot earlier on that when the replay_sync_error()
catches things. I see patterns like this:

   1878 tlb_fill 0x4770c000/1 1 2                                       
tlb_fill 0x4770c000/1 1 2
   1879 tlb_fill 0x4770d000/1 1 2                                       
tlb_fill 0x4770d000/1 1 2
   1880 tlb_fill 0x59000/1 0 2                                          
tlb_fill 0x59000/1 0 2
   1881                                                               > 
tlb_fill 0x476dd116/1 0 2
   1882 tlb_fill 0x4770e000/1 1 2                                       
tlb_fill 0x4770e000/1 1 2
   1883 tlb_fill 0x476dd527/1 0 2                                     | 
tlb_fill 0x476dfb17/1 0 2
   1884                                                               > 
tlb_fill 0x476de0fd/1 0 2
   1885                                                               > 
tlb_fill 0x476dce2e/1 0 2
   1886 tlb_fill 0x4770f000/1 1 2                                       
tlb_fill 0x4770f000/1 1 2
   1887 tlb_fill 0x476df939/1 0 2                                     <
   1888 tlb_fill 0x47710000/1 1 2                                       
tlb_fill 0x47710000/1 1 2
   1889 tlb_fill 0x47711000/1 1 2                                       
tlb_fill 0x47711000/1 1 2

These don't seem to affect the overall program flow but are concerning
because the memory access patterns should be the same. My
investigations with rr seem to indicate the difference is due to
behaviour of the victim_tlb_cache which again AFAICT should be
deterministic.

Anyway I can't spend any time debugging it this week so I thought I'd
post the current state in case anyone is curious enough to want to go
diving into record/replay.

The following need review:

  replay: report sync error when no exception in log (!DEBUG INVESTIGATION)
  accel/tcg: add trace_tlb_resize trace point
  accel/tcg: define tlb_fill as a trace point
  tests/avocado: remove skips from replay_kernel (1 acks, 1 sobs, 0 tbs)
  replay: stop us hanging in rr_wait_io_event
  replay/replay-char: use report_sync_error
  tests/avocado: modernise the drive args for replay_linux
  tests/avocado: add a simple i386 replay kernel test (2 acks, 1 sobs, 0 tbs)

Alex Bennée (16):
  tests/avocado: add a simple i386 replay kernel test
  tests/avocado: fix typo in replay_linux
  tests/avocado: modernise the drive args for replay_linux
  scripts/replay-dump: update to latest format
  scripts/replay_dump: track total number of instructions
  replay: remove host_clock_last
  replay: add proper kdoc for ReplayState
  replay: make has_unread_data a bool
  replay: introduce a central report point for sync errors
  replay/replay-char: use report_sync_error
  replay: stop us hanging in rr_wait_io_event
  chardev: force write all when recording replay logs
  tests/avocado: remove skips from replay_kernel
  accel/tcg: define tlb_fill as a trace point
  accel/tcg: add trace_tlb_resize trace point
  replay: report sync error when no exception in log (!DEBUG
    INVESTIGATION)

 include/sysemu/replay.h        |   5 ++
 replay/replay-internal.h       |  50 ++++++++----
 accel/tcg/cputlb.c             |   4 +
 accel/tcg/tcg-accel-ops-rr.c   |   2 +-
 chardev/char.c                 |  12 +++
 replay/replay-char.c           |   6 +-
 replay/replay-internal.c       |   5 +-
 replay/replay-snapshot.c       |   7 +-
 replay/replay.c                | 141 ++++++++++++++++++++++++++++++++-
 accel/tcg/trace-events         |   2 +
 scripts/replay-dump.py         |  95 +++++++++++++++++++---
 tests/avocado/replay_kernel.py |  27 ++++---
 tests/avocado/replay_linux.py  |   9 ++-
 13 files changed, 314 insertions(+), 51 deletions(-)

-- 
2.39.2


Reply via email to