Nathan Chancellor's on April 11, 2020 10:53 am: > Hi Nicholas, > > On Sat, Apr 11, 2020 at 10:29:45AM +1000, Nicholas Piggin wrote: >> Nathan Chancellor's on April 11, 2020 6:59 am: >> > Hi all, >> > >> > Recently, our CI started running into several hangs when running the >> > spinlock torture tests during a boot with QEMU 3.1.0 on >> > powernv_defconfig and pseries_defconfig when compiled with Clang. >> > >> > I initially bisected Linux and came down to commit 3282a3da25bd >> > ("powerpc/64: Implement soft interrupt replay in C") [1], which seems to >> > make sense. However, I realized I could not reproduce this in my local >> > environment no matter how hard I tried, only in our Docker image. I then >> > realized my environment's QEMU version was 4.2.0; I compiled 3.1.0 and >> > was able to reproduce it then. >> > >> > I bisected QEMU down to two commits: powernv_defconfig was fixed by [2] >> > and pseries_defconfig was fixed by [3]. >> >> Looks like it might have previously been testing power8, now power9? >> -cpu power8 might get it reproducing again. > > Yes, that is what it looks like. I can reproduce the hang with both > pseries-3.1 and powernv8 on QEMU 4.2.0. > >> > I ran 100 boots with our boot-qemu.sh script [4] and QEMU 3.1.0 failed >> > approximately 80% of the time but 4.2.0 and 5.0.0-rc1 only failed 1% of >> > the time [5]. GCC 9.3.0 built kernels failed approximately 3% of time >> > [6]. >> >> Do they fail in the same way? Was the fail rate at 0% before upgrading >> kernels? > > Yes, it just hangs after I see the print out that the torture tests are > running. > > [ 2.277125] spin_lock-torture: Creating torture_shuffle task > [ 2.279058] spin_lock-torture: Creating torture_stutter task > [ 2.280285] spin_lock-torture: torture_shuffle task started > [ 2.281326] spin_lock-torture: Creating lock_torture_writer task > [ 2.282509] spin_lock-torture: torture_stutter task started > [ 2.283511] spin_lock-torture: Creating lock_torture_writer task > [ 2.285155] spin_lock-torture: lock_torture_writer task started > [ 2.286586] spin_lock-torture: Creating lock_torture_stats task > [ 2.287772] spin_lock-torture: lock_torture_writer task started > [ 2.290578] spin_lock-torture: lock_torture_stats task started > > Yes, we never had any failures in our CI before that upgrade happened. I > will try to run a set of boot tests with a kernel built at the commit > right before 3282a3da25bd and at 3282a3da25bd to make triple sure I did > fall on the right commit. > >> > Without access to real hardware, I cannot really say if there is a >> > problem here. We are going to upgrade to QEMU 4.2.0 to fix it. This is >> > more of an FYI so that there is some record of it outside of our issue >> > tracker and so people can be aware of it in case it comes up somewhere >> > else. >> >> Thanks for this I'll try to reproduce. You're not running SMP guest? > > No, not as far as I am aware at least. You can see our QEMU line in our > CI and the boot-qemu.sh script I have listed below: > > https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/318260635 > >> Anything particular to run the lock torture test? This is just >> powernv_defconfig + CONFIG_LOCK_TORTURE_TEST=y ? > > We do enable some other configs, you can see those here: > > https://github.com/ClangBuiltLinux/continuous-integration/blob/c02d2f008a64d44e62518bc03beb1126db7619ce/configs/common.config > https://github.com/ClangBuiltLinux/continuous-integration/blob/c02d2f008a64d44e62518bc03beb1126db7619ce/configs/tt.config > > The tt.config values are needed to reproduce but I did not verify that > ONLY tt.config was needed. Other than that, no, we are just building > either pseries_defconfig or powernv_defconfig with those configs and > letting it boot up with a simple initramfs, which prints the version > string then shuts the machine down. > > Let me know if you need any more information, cheers!
Okay I can reproduce it. Sometimes it eventually recovers after a long pause, and some keyboard input often helps it along. So that seems like it might be a lost interrupt. POWER8 vs POWER9 might just be a timing thing if P9 is still hanging sometimes. I wasn't able to reproduce it with defconfig+tt.config, I needed your other config with various other debug options. Thanks for the very good report. I'll let you know what I find. Thanks, Nick