On Sat, 24 Jul 2021 at 21:48, Richard Henderson <richard.hender...@linaro.org> wrote: > > On 7/24/21 10:27 AM, Peter Maydell wrote: > > On Sat, 24 Jul 2021 at 14:49, Peter Maydell <peter.mayd...@linaro.org> > > wrote: > >> There is a slight difficulty here with testing this: icount > >> doesn't seem to work for sparc Linux guests in master at the > >> moment. For instance if you get the advent calendar image from > >> https://www.qemu-advent-calendar.org/2018/download/day11.tar.xz > >> it will boot without icount with a command line like > >> qemu-system-sparc -display none -vga none -machine SS-20 -serial stdio > >> -kernel /tmp/day11/zImage.elf > >> But if you add '-icount auto' it will get as far as > >> "bootconsole [earlyprom0] disabled" and then apparently hang. > >> I'm not sure what's going on here :-( > >> (I filed this as https://gitlab.com/qemu-project/qemu/-/issues/499) > > > > This turns out to be a recent regression, caused by commit 78ff82bb > > ("accel/tcg: Reduce CF_COUNT_MASK to match TCG_MAX_INSNS"). It's > > an intermittent rather than a 100% reproducible hang. > > Ouch. Ok, I'll have a look.
I did a bit more messing around with a repro case under rr, and I think I now see why we end up hanging, although I'm not 100% sure what best to do to fix it: * We have a TB with 512 insns (tb->icount == 512) * We want to execute 511 insns (icount_decr == 511) * the code generated by gen_tb_start() does "subtract this TB's instruction count from icount_decr.u16.low, and if this is negative jump to the exitlabel". 511 - 512 == -1, so we exit the TB with status TB_EXIT_REQUESTED and without executing any guest insns * in cpu_loop_exit_tb() we look at insns_left, which is still 511, so we don't take the "early return because exit_request" path * we calculate a new insns_left = MIN(CF_COUNT_MASK, cpu->icount_budget). icount_budget is 59267, so the new insns_left is still 511. * we set icount_decr.u16.low to insns_left (ie same value as before) * we set cpu->icount_extra to icount_budget - insns_left, which is 58756 * because icount_extra is non-zero, we don't set cflags_next_tb to force us to find an exactly 511 insn TB * so we come out to the cpu_exec() main loop, and find again the same 512 insn TB we started with. * Nothing changed from the last time we tried to execute it so we just go round and round in circles never making any progress... We don't get this failure mode if CF_COUNT_MASK is larger than TCG_MAX_INSNS, because the calculation of insns_left will produce a larger number than TCG_MAX_INSNS, unless we really are running out of icount budget (in which case icount_extra should be 0 and we will force execution of that smaller TB). So the primary bug here is that cpu_loop_exec_tb() needs updating to follow the new logic of "allow insns_left = TCG_MAX_INSNS and indicate that with 0 in the CF_COUNT_MASK field". Q: in cpu_loop_exec_tb() in this calculation: /* * If the next tb has more instructions than we have left to * execute we need to ensure we find/generate a TB with exactly * insns_left instructions in it. */ if (!cpu->icount_extra && insns_left > 0 && insns_left < tb->icount) { cpu->cflags_next_tb = (tb->cflags & ~CF_COUNT_MASK) | insns_left; } why are we testing "!cpu->icount_extra" ? The thing the comment says we're looking for ("this TB has more insns than we have left to execute") would be just "insns_left > 0 && insns_left < tb->icount". And the code generated in gen_tb_start() will exit without doing anything if insns_left < tb->icount (as described above), so we'll get into an infinite loop pretty much any time we decide not to force execution of a smaller TB. It's merely that we're much more likely to do so with CF_COUNT_MASK==511, because we are accidentally very often trying to execute 511 insns of a 512 insn TB when we could execute all 512. thanks -- PMM