Pavel Dovgalyuk <dovga...@ispras.ru> writes: >> From: mttcg-requ...@listserver.greensocs.com >> [mailto:mttcg-requ...@listserver.greensocs.com] >> Pavel Dovgalyuk <dovga...@ispras.ru> writes: >> >> From: mttcg-requ...@listserver.greensocs.com [mailto:mttcg- >> requ...@listserver.greensocs.com] >> >> Pavel Dovgalyuk <dovga...@ispras.ru> writes: >> >> >> From: mttcg-requ...@listserver.greensocs.com [mailto:mttcg- >> >> requ...@listserver.greensocs.com] >> >> >> Pavel Dovgalyuk <dovga...@ispras.ru> writes: >> <snip> >> >> >> > I tested on vexpress-a9 platform with Debian wheezy. >> >> >> >> >> >> Thanks for that. I now have a test case that I can reproduce failures >> >> >> on >> >> >> without needing graphics. >> >> >> >> >> >> I've been investigating if there are any problems with the timer >> >> >> processing now they have been moved into the TCG thread. The record >> >> >> stage seems to work fine but I'm having difficulty figuring out why >> >> >> playback freezes. It seems we get to a point where we are stuck waiting >> >> >> for a suspiciously exact timer deadline: >> >> > >> >> > I've encountered the following behavior at replay stage: >> >> > - replay takes some instructions to execute (qemu_icount += counter) >> >> > - virtual timer is fired >> >> >> >> This is the virtual timer based on icount not the virtual rt timer? So >> > >> > Regular virtual timer. It's value is based on icount. >> > virtual_rt is used for internal icount purposes. >> >> And this is where the clock warps come in? The bias brings the >> virtual_rt time forward because execution is waiting for some external >> event to fire (e.g. a timer IRQ)? > > I guess so. But bias is not updated when the vCPU works. > vCPU thread updates only qemu_icount which is used for virtual clock > calculation. > >> >> under the new scheme of being processed in the vCPU loop we should only >> >> fire this one once all execution is done (although you may exit the loop >> >> early before icount is exhausted). >> > >> > We should stop the vCPU before processing virtual timers because: >> > - virtual clock depends on instruction counter >> > - virtual timers may change virtual hardware state >> >> If we do the processing in the top of main vCPU loop this should be >> equivalent. The instruction counter cannot run because we haven't >> entered tcg_exec_cpu. We also process virtual timers in this thread >> outside the loop so nothing else can be poking the hardware state. > > This is how qemu worked in older version - it switched between > processing tasks (vCPU and timers) in one thread.
The only real difference is the sequencing in the old case was protected by the BQL - now its my program order. > But how we'll join this behavior with the current design and MTTCG? Ignore MTTCG for now. We don't even try and run multiple-threads when icount is enabled. However the change in the BQL locking is what has triggered the failures. Anyway I think I'm getting closer to narrowing it down. On record I see replay_current_step jump backwards with this: /* A common event print, called when reading or saving an event */ static void print_event(uint8_t event) { static int event_count; static uint64_t last_step; uint64_t this_step = replay_get_current_step(); fprintf(stderr, "replay: event %d is %d @ step=%#" PRIx64 "\n", event_count, event, this_step); if (this_step < last_step) { fprintf(stderr,"%s: !!! step %d went backwards %#"PRIx64"=>%#"PRIx64"!!!\n", __func__, event_count, last_step, this_step); abort(); } last_step = this_step; event_count++; } void replay_put_event(uint8_t event) { assert(event < EVENT_COUNT); replay_put_byte(event); print_event(event); } The jump back is fairly consistent in my case (event 66 in the vexpress run) but I'm fairly sure that is the bug. I can't see any reason for the step count to go backwards - indeed that breaks the premise of . Unfortunately when I start putting break and watchpoints in to see how this jump back occurs the problem goes away until I disable them. So this very much looks like a race condition corrupting the icount data. >> >> > - replay puts back unexecuted instructions (qemu_icount -= counter) >> >> > >> >> > But virtual timer cannot take in account non-executed instructions >> >> > (counter) and >> >> > therefore it reads only qemu_icount, which provides incorrect time. >> >> > >> >> >> But the timers are all enabled: >> >> >> >> >> >> (gdb) qemu timers >> >> >> Processing Realtime timers >> >> >> clock QEMU_CLOCK_REALTIME is enabled:true, >> >> >> last:-9223372036854775808 >> >> >> Processing Virtual timers >> >> >> clock QEMU_CLOCK_VIRTUAL is enabled:true, last:-9223372036854775808 >> >> >> timer 34297350016/1 (cb:0x555555a2e952 <ptimer_tick>) >> >> >> timer 503290000000/1000000 (cb:0x555555bd4d1d <ra_timer_handler>) >> >> >> Processing Host timers >> >> >> clock QEMU_CLOCK_HOST is enabled:true, last:1490191319270134000 >> >> >> Processing Virtual RT timers >> >> >> clock QEMU_CLOCK_VIRTUAL_RT is enabled:true, >> >> >> last:-9223372036854775808 >> >> > >> >> > Timers are processed only at checkpoints recorded in the log. >> >> > When replay is stuck, probably there is a pending checkpoint in the log >> >> > and pending instructions in CPU (because iothread breaks its >> >> > synchronization). >> >> >> >> How does this work for instructions that sleep (like WFI)? The WFI will >> >> cause us to exit the loop but I assume we need a "real" timer to trigger >> >> a firing of whatever the WFI is waiting for as time according to icount >> >> isn't advancing. >> > >> > This is virtual_rt for. It advances even when vCPU is stopped. >> >> So I'm not seeing this happen: >> >> process_icount_data: icount not increased 117302332749=>117302332749 RT:-1 >> >> I wonder if the advance of virtual_rt has been deadlocked? The -1 >> suggests its not even setup. > > qemu_start_warp_timer says that it is setup only when there are active > virtual timers. > > >> > By the way, I made small description of the replay log format >> > to be later included into docs/replay.txt: >> > >> Thanks. That's really useful. >> >> Do you think it would be worth including the dumper script I wrote in >> ./scripts/? I mainly wrote it to aid my understanding of the code flow >> but it might be useful in future investigations? > > Usually I get human-readable logs in record and replay modes to find > divergence between them, but I think your script might be useful in some > cases. Cool. I'll include it in the icount series once I have something to post (fingers crossed I'm getting close). -- Alex Bennée