Hi Tom, On Thu, 27 Jul 2023 at 14:35, Tom Rini <tr...@konsulko.com> wrote: > > On Thu, Jul 27, 2023 at 01:18:12PM -0600, Simon Glass wrote: > > Hi Tom, > > > > On Sun, 16 Jul 2023 at 12:18, Tom Rini <tr...@konsulko.com> wrote: > > > > > > On Sat, Jul 15, 2023 at 05:40:25PM -0600, Simon Glass wrote: > > > > Hi Tom, > > > > > > > > On Thu, 13 Jul 2023 at 15:57, Tom Rini <tr...@konsulko.com> wrote: > > > > > > > > > > On Thu, Jul 13, 2023 at 03:03:57PM -0600, Simon Glass wrote: > > > > > > Hi Tom, > > > > > > > > > > > > On Wed, 12 Jul 2023 at 14:38, Tom Rini <tr...@konsulko.com> wrote: > > > > > > > > > > > > > > On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote: > > > > > > > > Hi Tom, > > > > > > > > > > > > > > > > On Wed, 12 Jul 2023 at 11:09, Tom Rini <tr...@konsulko.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote: > > > > > > > > > > Hi Tom, > > > > > > > > > > > > > > > > > > > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <tr...@konsulko.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > It is not uncommon for some of the QEMU-based jobs to > > > > > > > > > > > fail not because > > > > > > > > > > > of a code issue but rather because of a timing issue or > > > > > > > > > > > similar problem > > > > > > > > > > > that is out of our control. Make use of the keywords that > > > > > > > > > > > Azure and > > > > > > > > > > > GitLab provide so that we will automatically re-run these > > > > > > > > > > > when they fail > > > > > > > > > > > 2 times. If they fail that often it is likely we have > > > > > > > > > > > found a real issue > > > > > > > > > > > to investigate. > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Tom Rini <tr...@konsulko.com> > > > > > > > > > > > --- > > > > > > > > > > > .azure-pipelines.yml | 1 + > > > > > > > > > > > .gitlab-ci.yml | 1 + > > > > > > > > > > > 2 files changed, 2 insertions(+) > > > > > > > > > > > > > > > > > > > > This seems like a slippery slope. Do we know why things > > > > > > > > > > fail? I wonder > > > > > > > > > > if we should disable the tests / builders instead, until it > > > > > > > > > > can be > > > > > > > > > > corrected? > > > > > > > > > > > > > > > > > > It happens in Azure, so it's not just the broken runner > > > > > > > > > problem we have > > > > > > > > > in GitLab. And the problem is timing, as I said in the commit. > > > > > > > > > Sometimes we still get the RTC test failing. Other times we > > > > > > > > > don't get > > > > > > > > > QEMU + U-Boot spawned in time (most often m68k, but sometimes > > > > > > > > > x86). > > > > > > > > > > > > > > > > How do we keep this list from growing? > > > > > > > > > > > > > > Do we need to? The problem is in essence since we rely on free > > > > > > > resources, sometimes some heavy lifts take longer. That's what > > > > > > > this > > > > > > > flag is for. > > > > > > > > > > > > I'm fairly sure the RTC thing could be made deterministic. > > > > > > > > > > We've already tried that once, and it happens a lot less often. If we > > > > > make it even looser we risk making the test itself useless. > > > > > > > > For sleep, yes, but for rtc it should be deterministic now...next time > > > > you get a failure could you send me the trace? > > > > > > Found one: > > > https://dev.azure.com/u-boot/u-boot/_build/results?buildId=6592&view=logs&j=b6c47816-145c-5bfe-20a7-c6a2572e6c41&t=0929c28c-6e32-5635-9624-54eaa917d713&l=599 > > > > I don't seem to have access to that...but it is rtc or sleep? > > It was the RTC one, and has since rolled off and been deleted. > > > > And note that we have a different set of timeout problems that may or may > > > not > > > be configurable, which is in the upload of the pytest results. I haven't > > > seen > > > if there's a knob for this one yet, within Azure (or the python project > > > we're > > > adding for it). > > > > Oh dear. > > > > > > > > > > > The spawning thing...is there a timeout for that? What actually > > > > > > fails? > > > > > > > > > > It doesn't spawn in time for the framework to get to the prompt. We > > > > > could maybe increase the timeout value. It's always the version test > > > > > that fails. > > > > > > > > Ah OK, yes increasing the timeout makes sense. > > > > > > > > > > > > > > > > > > > I'll note that we don't have this problem with sandbox > > > > > > > > > > tests. > > > > > > > > > > > > > > > > > > OK, but that's not relevant? > > > > > > > > > > > > > > > > It is relevant to the discussion about using QEMU instead of > > > > > > > > sandbox, > > > > > > > > e.g. with the TPM. I recall a discussion with Ilias a while > > > > > > > > back. > > > > > > > > > > > > > > I'm sure we could make sandbox take too long to start as well, if > > > > > > > enough > > > > > > > other things are going on with the system. And sandbox has its > > > > > > > own set > > > > > > > of super frustrating issues instead, so I don't think this is a > > > > > > > great > > > > > > > argument to have right here (I have to run it in docker, to get > > > > > > > around > > > > > > > some application version requirements and exclude event_dump, > > > > > > > bootmgr, > > > > > > > abootimg and gpt tests, which could otherwise run, but fail for > > > > > > > me). > > > > > > > > > > > > I haven't heard about this before. Is there anything that could be > > > > > > done? > > > > > > > > > > I have no idea what could be done about it since I believe all of them > > > > > run fine in CI, including on this very host, when gitlab invokes it > > > > > rather than when I invoke it. My point here is that sandbox tests are > > > > > just a different kind of picky about things and need their own kind of > > > > > "just hit retry". > > > > > > > > Perhaps this is Python dependencies? I'm not sure, but if you see it > > > > again, please let me know in case we can actually fix this. > > > > > > Alright. So the first pass I took at running sandbox pytest with as > > > little hand-holding as possible I hit the known issue of /boot/vmlinu* > > > being 0400 in Ubuntu. I fixed that and then re-ran and: > > > test/py/tests/test_cleanup_build.py F > > > > > > ========================================== FAILURES > > > =========================================== > > > _________________________________________ test_clean > > > __________________________________________ > > > test/py/tests/test_cleanup_build.py:94: in test_clean > > > assert not leftovers, f"leftovers: {', '.join(map(str, leftovers))}" > > > E AssertionError: leftovers: fdt-out.dtb, sha1-pad/sandbox-u-boot.dtb, > > > sha1-pad/sandbox-kernel.dtb, sha1-basic/sandbox-u-boot.dtb, > > > sha1-basic/sandbox-kernel.dtb, sha384-basic/sandbox-u-boot.dtb, > > > sha384-basic/sandbox-kernel.dtb, algo-arg/sandbox-u-boot.dtb, > > > algo-arg/sandbox-kernel.dtb, sha1-pss/sandbox-u-boot.dtb, > > > sha1-pss/sandbox-kernel.dtb, sha256-pad/sandbox-u-boot.dtb, > > > sha256-pad/sandbox-kernel.dtb, sha256-global-sign/sandbox-binman.dtb, > > > sha256-global-sign/sandbox-u-boot.dtb, > > > sha256-global-sign/sandbox-u-boot-global.dtb, > > > sha256-global-sign/sandbox-kernel.dtb, > > > sha256-global-sign-pss/sandbox-binman-pss.dtb, > > > sha256-global-sign-pss/sandbox-u-boot.dtb, > > > sha256-global-sign-pss/sandbox-kernel.dtb, > > > sha256-global-sign-pss/sandbox-u-boot-global-pss.dtb, auto_fit/dt-1.dtb, > > > auto_fit/dt-2.dtb, sha256-pss/sandbox-u-boot.dtb, > > > sha256-pss/sandbox-kernel.dtb, sha256-pss-pad/sandbox-u-boot.dtb, > > > sha256-pss-pad/sandbox-kernel.dtb, hashes/sandbox-kernel.dtb, > > > sha256-basic/sandbox-u-boot.dtb, sha256-basic/sandbox-kernel.dtb, > > > sha1-pss-pad/sandbox-u-boot.dtb, sha1-pss-pad/sandbox-kernel.dtb, > > > sha384-pad/sandbox-u-boot.dtb, sha384-pad/sandbox-kernel.dtb, > > > sha256-pss-pad-required/sandbox-u-boot.dtb, > > > sha256-pss-pad-required/sandbox-kernel.dtb, ecdsa/sandbox-kernel.dtb, > > > sha256-pss-required/sandbox-u-boot.dtb, > > > sha256-pss-required/sandbox-kernel.dtb > > > E assert not [PosixPath('fdt-out.dtb'), > > > PosixPath('sha1-pad/sandbox-u-boot.dtb'), > > > PosixPath('sha1-pad/sandbox-kernel.dtb'), > > > PosixPa...ic/sandbox-u-boot.dtb'), > > > PosixPath('sha1-basic/sandbox-kernel.dtb'), > > > PosixPath('sha384-basic/sandbox-u-boot.dtb'), ...] > > > ------------------------------------ Captured stdout call > > > ------------------------------------- > > > +make O=/tmp/pytest-of-trini/pytest-231/test_clean0 clean > > > make[1]: Entering directory '/tmp/pytest-of-trini/pytest-231/test_clean0' > > > CLEAN cmd > > > CLEAN dts/../arch/sandbox/dts > > > CLEAN dts > > > CLEAN lib > > > CLEAN tools > > > CLEAN tools/generated > > > CLEAN include/bmp_logo.h include/bmp_logo_data.h > > > include/generated/env.in include/generated/env.txt > > > drivers/video/u_boot_logo.S u-boot u-boot-dtb.bin u-boot-initial-env > > > u-boot-nodtb.bin u-boot.bin u-boot.cfg u-boot.dtb u-boot.dtb.gz > > > u-boot.dtb.out u-boot.dts u-boot.lds u-boot.map u-boot.srec u-boot.sym > > > System.map image.map keep-syms-lto.c lib/efi_loader/helloworld_efi.S > > > make[1]: Leaving directory '/tmp/pytest-of-trini/pytest-231/test_clean0' > > > =================================== short test summary info > > > =================================== > > > FAILED test/py/tests/test_cleanup_build.py::test_clean - AssertionError: > > > leftovers: fdt-out.... > > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures > > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > > > ================================= 1 failed, 6 passed in 6.42s > > > ================================= > > > > That test never passes for me locally, because as you say we add a lot > > of files to the build directory and there is no tracking of them such > > that 'make clean' could remove them. We could fix that, e.g.: > > > > 1. Have binman record all its output filenames in a binman.clean file > > 2. Have tests always use a 'testfiles' subdir for files they create > > It sounds like this is showing some bugs in how we use binman since > "make clean" should result in a clean tree, and I believe we get a few > patches now and again about removing leftover files. > > > > Fixing that manually with an rm -rf of /tmp/pytest-of-trini and now it's > > > stuck. I've rm -rf'd that and git clean -dfx and just repeat that > > > failure. I'm hopeful that when I reboot whatever magic is broken will > > > be cleaned out. Moving things in to a docker container again, I get: > > > =========================================== ERRORS > > > ============================================ > > > _______________________________ ERROR at setup of test_gpt_read > > > _______________________________ > > > /home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:74: in > > > state_disk_image > > > ??? > > > /home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:37: in __init__ > > > ??? > > > test/py/u_boot_utils.py:279: in __enter__ > > > self.module_filename = module.__file__ > > > E AttributeError: 'NoneType' object has no attribute '__file__' > > > =================================== short test summary info > > > =================================== > > > ERROR test/py/tests/test_gpt.py::test_gpt_read - AttributeError: > > > 'NoneType' object has no at... > > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures > > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > > > ========================== 41 passed, 45 skipped, 1 error in 19.29s > > > =========================== > > > > > > And then ignoring that one with "-k not gpt": > > > test/py/tests/test_android/test_ab.py E > > > > > > =========================================== ERRORS > > > ============================================ > > > __________________________________ ERROR at setup of test_ab > > > __________________________________ > > > /home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:54: > > > in ab_disk_image > > > ??? > > > /home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:28: > > > in __init__ > > > ??? > > > test/py/u_boot_utils.py:279: in __enter__ > > > self.module_filename = module.__file__ > > > E AttributeError: 'NoneType' object has no attribute '__file__' > > > =================================== short test summary info > > > =================================== > > > ERROR test/py/tests/test_android/test_ab.py::test_ab - AttributeError: > > > 'NoneType' object has... > > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures > > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > > > ============= 908 passed, 75 skipped, 10 deselected, 1 error in 159.17s > > > (0:02:39) ============= > > > > These two are the same error. It looks like somehow it is unable to > > obtain the module with: > > > > frame = inspect.stack()[1] > > module = inspect.getmodule(frame[0]) > > > > i.e. module is None > > > > +Stephen Warren who may know > > > > What Python version is this? > > It's the docker container we use for CI, where these tests pass every > time they're run normally, whatever is in Ubuntu "Jammy". > > > > Now, funny things. If I git clean -dfx, I can then get that test to > > > pass. So I guess something else isn't cleaning up / is writing to a > > > common area? I intentionally build within the source tree, but in a > > > subdirectory of that, and indeed a lot of tests write to the source > > > directory itself. > > > > Wow that really is strange. The logic in that class is pretty clever. > > Do you see a message like 'Waiting for generated file timestamp to > > increase' at any point? > > > > BTW these problems don't have anything to do with sandbox, which I > > think was your original complaint. The more stuff we bring into tests > > (Python included) the harder it gets. > > The original complaint, as I saw it, was that "sandbox pytests don't > randomly fail". My point is that sandbox pytests randomly fail all the > time. QEMU isn't any worse. I can't say it's better since my local > loop is sandbox for sanity then on to hardware.
The way I see it, terms of flakiness, speed and ease of debugging, from best to worse, we have: - sandbox - Python wrapper around sandbox - Python wrapper around QEMU Regards, Simon