On 09/05/2017 10:59 AM, Richard Purdie wrote:
On Tue, 2017-09-05 at 10:24 -0400, Bruce Ashfield wrote:
On 09/05/2017 10:13 AM, Richard Purdie wrote:

Hi Bruce,

We had a locked up qemuppc lsb image and I was able to find
backtraces
from the serial console log (/home/pokybuild/yocto-
autobuilder/yocto-
worker/nightly-ppc-lsb/build/build/tmp/work/qemuppc-poky-
linux/core-
image-lsb/1.0-r0/target_logs/dmesg_output.log in case anyone ever
needs
to find that). The log is below, this one is for the 4.9 kernel.

Failure as seen on the AB:
https://autobuilder.yoctoproject.org/main/builders/nightly-ppc-lsb/
buil
ds/1189/steps/Running%20Sanity%20Tests/logs/stdio

Not sure what it means, perhaps you can make more sense of it? :)
Very interesting.

I'm (un)fortunately familiar with RCU issues, and obviously, this is
only happening under load. There's clearly a driver issue as it
interacts with whatever is running in userspace.

  From the log, it looks like this is running over NFS and pinning the
CPU and the qemu ethernet isn't handling it gracefully.

Looking at the logs I've seen I don't think this is over NFS, it should
be over virtio:

"Kernel command line: root=/dev/vda"

But exactly what it is, I can't say from that trace. I'll try and do
a cpu-pinned test on qemuppc (over NFS) and see if I can trigger the
same trace.

I'm also not sure what this might be. I did a bit more staring at the
log and I think the system did come back:

NOTE: core-image-lsb-1.0-r0 do_testimage:   test_dnf_install_from_disk 
(dnf.DnfRepoTest)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... OK (249.929s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_dnf_install_from_http 
(dnf.DnfRepoTest)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... OK (212.547s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_dnf_reinstall (dnf.DnfRepoTest)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... FAIL (1501.682s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_dnf_repoinfo (dnf.DnfRepoTest)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... FAIL (15.952s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_syslog_running 
(oe_syslog.SyslogTest)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... FAIL (3.039s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_syslog_logger 
(oe_syslog.SyslogTestConfig)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... SKIP (0.001s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_syslog_restart 
(oe_syslog.SyslogTestConfig)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... SKIP (0.001s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_syslog_startup_config 
(oe_syslog.SyslogTestConfig)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... SKIP (0.001s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_pam (pam.PamBasicTest)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... FAIL (3.003s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_parselogs 
(parselogs.ParseLogsTest)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... OK (39.675s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_rpm_help (rpm.RpmBasicTest)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... OK (2.590s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_rpm_query (rpm.RpmBasicTest)
NOTE: core-image-lsb-1.0-r0 do_testimage:  ... OK (2.295s)
NOTE: core-image-lsb-1.0-r0 do_testimage:   test_rpm_instal

So for a while there the system "locked up":

AssertionError: 255 != 0 : dnf 
--repofrompath=oe-testimage-repo-noarch,http://192.168.7.1:38838/noarch 
--repofrompath=oe-testimage-repo-qemuppc,http://192.168.7.1:38838/qemuppc 
--repofrompath=oe-testimage-repo-ppc7400,http://192.168.7.1:38838/ppc7400 
--nogpgcheck reinstall -y run-postinsts-dev

Process killed - no output for 1500 seconds. Total running time: 1501 seconds.

AssertionError: 255 != 0 : dnf 
--repofrompath=oe-testimage-repo-noarch,http://192.168.7.1:38838/noarch 
--repofrompath=oe-testimage-repo-qemuppc,http://192.168.7.1:38838/qemuppc 
--repofrompath=oe-testimage-repo-ppc7400,http://192.168.7.1:38838/ppc7400 
--nogpgcheck repoinfo
ssh: connect to host 192.168.7.2 port 22: No route to host

self.assertEqual(status, 1, msg = msg)
AssertionError: 255 != 1 : login command does not work as expected. Status and 
output:255 and ssh: connect to host 192.168.7.2 port 22: No route to host

then the system seems to have come back. All very odd...

I'd expect after the stall that it would come back. But it
is good news that it isn't over NFS, since that would make things
harder to reproduce.

There's some sort of cpu intensive task -> virtio that is not
allowing softIRQd to run within limits.

We could back off the warning and increase the limit, but that
can cause more serious problems down the road.

Bruce


Cheers,

Richard


--
_______________________________________________
Openembedded-core mailing list
Openembedded-core@lists.openembedded.org
http://lists.openembedded.org/mailman/listinfo/openembedded-core

Reply via email to