On 9/4/20 11:12 AM, Satish Balay wrote:
The test harness prints:

# To rerun failed tests:
#     /usr/bin/gmake -f gmakefile test test-fail=1

So perhaps we the CI can be changed to ignore result of 'make alltests' - and 
always run this [and then check the error code]

But this says even if we have legit failures then we should rerun this, and then worry about whether it is a real error code.

However - I'm not seeing error return here..

Satish
------

[balay@pj01 petsc.x]$ make test globsearch='*ksp*tests*ex49_*cg*'
Using MAKEFLAGS: -- globsearch=*ksp*tests*ex49_*cg*
         TEST arch-complex/tests/counts/ksp_ksp_tests-ex49_cg.counts
  ok ksp_ksp_tests-ex49_cg
not ok diff-ksp_ksp_tests-ex49_cg # Error code: 1
#       2d1
#       < extra text
         TEST arch-complex/tests/counts/ksp_ksp_tests-ex49_pipecg2.counts

This isn't a good example since it's a diff error.  It's not what Barry is referring to.

Scott

  ok ksp_ksp_tests-ex49_pipecg2+ksp_norm_type-preconditioned
  ok diff-ksp_ksp_tests-ex49_pipecg2+ksp_norm_type-preconditioned
  ok ksp_ksp_tests-ex49_pipecg2+ksp_norm_type-unpreconditioned
  ok diff-ksp_ksp_tests-ex49_pipecg2+ksp_norm_type-unpreconditioned
  ok ksp_ksp_tests-ex49_pipecg2+ksp_norm_type-natural
  ok diff-ksp_ksp_tests-ex49_pipecg2+ksp_norm_type-natural

# -------------
#   Summary
# -------------
# FAILED diff-ksp_ksp_tests-ex49_cg
# success 7/8 tests (87.5%)
# failed 1/8 tests (12.5%)
# todo 0/8 tests (0.0%)
# skip 0/8 tests (0.0%)
#
# Wall clock time for tests: 1 sec
# Approximate CPU time (not incl. build time): 0.19 sec
#
# To rerun failed tests:
#     /usr/bin/gmake -f gmakefile test test-fail=1
#
# Timing summary (actual test time / total CPU time):
#   ksp_ksp_tests-ex49_pipecg2: 0.02 sec / 0.19 sec
#   ksp_ksp_tests-ex49_cg: 0.00 sec / 0.00 sec
[balay@pj01 petsc.x]$ echo $?
0
[balay@pj01 petsc.x]$ /usr/bin/gmake -f gmakefile test test-fail=1
Using MAKEFLAGS: -- test-fail=1
         TEST arch-complex/tests/counts/ksp_ksp_tests-ex49_cg.counts
  ok ksp_ksp_tests-ex49_cg
not ok diff-ksp_ksp_tests-ex49_cg # Error code: 1
#       2d1
#       < extra text

# -------------
#   Summary
# -------------
# FAILED diff-ksp_ksp_tests-ex49_cg
# success 1/2 tests (50.0%)
# failed 1/2 tests (50.0%)
# todo 0/2 tests (0.0%)
# skip 0/2 tests (0.0%)
#
# Wall clock time for tests: 0 sec
# Approximate CPU time (not incl. build time): 0.01 sec
#
# To rerun failed tests:
#     /usr/bin/gmake -f gmakefile test test-fail=1
#
# Timing summary (actual test time / total CPU time):
#   ksp_ksp_tests-ex49_cg: 0.01 sec / 0.01 sec
[balay@pj01 petsc.x]$ echo $?
0
[balay@pj01 petsc.x]$



On Fri, 4 Sep 2020, Scott Kruger wrote:


That's a good idea, but I'll have to think about this a bit.   It seems
relatively straightforward, but I'd be doing this in bash so I'd like to come
up with an implementation that is not overly complicated.    Do you have a job
that has the issue offhand?

Scott


On 9/4/20 10:27 AM, Barry Smith wrote:
    Scott,

     How difficult would it be for the test harness to run a failed test
     again if the return code has specific values? Instead of erroring out.

     I am thinking in particular about GPUs but it is general. If the GPU
     doesn't have he resources available it will error out thus crashing the
     entire job in the pipeline requiring retrying the job from the GUI.
     Wasting everyone's time.

     Seems in theory like it should be pretty straightforward but, of course,
     unforeseen issues can make it difficult. Just check the program's error
     code and it if is certain values run the program again, or wait a few
     seconds and run

    Barry


Issues are still broken hence here.


--
Tech-X Corporation               [email protected]
5621 Arapahoe Ave, Suite A       Phone: (720) 974-1841
Boulder, CO 80303                Fax:   (303) 448-7756

Reply via email to