Hello,
I tried
mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x
OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host
node009,node009,node009,node009 ./xdsyevr
mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x
OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host
node009,node010,node009,node010 ./xdsyevr
This does not change anything.
I made an attempt to narrow down what happens. Sorry, but this is a
bit longer. A stack trace is also below.
Looking at the actual numbers, see at the very bottom, I notice that
the CHK and QTQ columns (9th and 10th column, maximum over all
eigentests) between the two OpenMPIs are simliar. What changes is the
"IL, IU, VL or VU altered by PDSYEVR" line which is not present in the
output with 1.10.4, only in the 2.0.1 output. Looking at
pdseprsubtst.f, comment line 751, I see that this is (as far as I
understand it) a sanity check.
Inserting my own print statement in pdseprsubtst.f (and changing
optimization to "-O0 -g"), i.e.
IF( IL.NE.OLDIL .OR. IU.NE.OLDIU .OR. VL.NE.OLDVL .OR. VU.NE.
$ OLDVU ) THEN
IF( IAM.EQ.0 ) THEN
WRITE( NOUT, FMT = 9982 )
WRITE( NOUT, '(F8.3,F8.3,F8.3,F8.3)') VL,VU,OLDVL,OLDVU
WRITE( NOUT, '(I10,I10,I10,I10)') IL,IU,OLDIL,OLDIU
END IF
RESULT = 1
END IF
The result with 2.0.1 is
500 2 2 2 8 Y 0.08 -1.00 0.81E-03 3.3 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
NaN 0.000 NaN 0.000
-1 132733856 -1 132733856
500 4 1 4 8 Y 0.18 -1.00 0.84E-03 3.5 FAILED
500 4 4 1 8 Y 0.17 -1.00 0.78E-03 2.9 PASSED EVR
The values OLDVL and OLDVU are the saved values of VL and VU on entry
in pdseprsubtst (line 253 and 254) _before_ the actual eigensolver
pdsyevr is called.
Working upwards in the call tree and additionally inserting
IF (IAM.EQ.0) THEN
WRITE(NOUT,'(F8.3,F8.3)') VL, VU
ENDIF
right before each call to PDSEPRSUBTST in pdseprtst.f gives with 2.0.1
500 2 2 2 8 Y 0.07 -1.00 0.81E-03 3.3 PASSED EVR
NaN 0.000
IL, IU, VL or VU altered by PDSYEVR
NaN 0.000 NaN 0.000
-1 128725600 -1 128725600
500 4 1 4 8 Y 0.16 -1.00 0.84E-03 3.5 FAILED
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.343 0.377
-0.697 0.104
500 4 4 1 8 Y 0.17 -1.00 0.76E-03 3.1 PASSED EVR
With 1.10.4
500 2 2 2 8 Y 0.07 -1.00 0.80E-03 4.4 PASSED EVR
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.435 0.884
-0.804 0.699
500 4 1 4 8 Y 0.08 -1.00 0.91E-03 3.3 PASSED EVR
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
-0.437 0.253
-0.603 0.220
500 4 4 1 8 Y 0.17 -1.00 0.83E-03 3.7 PASSED EVR
So something goes wrong early and it is probably not related to numerics.
Setting -ffpe-trap=invalid,zero,overflow in FCFLAGS (and NOOPT), which
of course does nothing to the BLACS and C routines, although the stack
trace below ends in a C routine (which might be spurious).
login 14:04 ~/src/scalapack/TESTING % mpirun -n 4 --mca pml ob1 -x
PATH -x LD_LIBRARY_PATH -x OMP_NUM_T[31/1861]
ca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
Check if overflow is handled in ieee default manner.
If this is the last output you see, you should assume
that overflow caused a floating point exception.
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Backtrace for this error:
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Backtrace for this error:
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Backtrace for this error:
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Backtrace for this error:
#0 0x2b921971266f in ???
#0 0x2ade83c4966f in ???
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2 0x40457b in pdseprdriver
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#2 0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#0 0x2b414549566f in ???
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2 0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#0 0x2b3701f4766f in ???
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2 0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
Not sure why pdlachkieee_ appears twice !
Thank you for your help !
Best Regards
Christof
Original output without my inserted WRITE statements:
On a single node (node009) with 2.0.1
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.23 -1.00 0.18E-02 26. FAILED
500 1 2 1 8 Y 0.09 -1.00 0.74E-03 3.2 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.16 -1.00 0.83E-03 2.3 FAILED
500 1 2 2 8 Y 0.07 -1.00 0.77E-03 2.2 PASSED EVR
500 2 2 2 8 Y 0.04 -1.00 0.81E-03 3.3 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.05 -1.00 0.84E-03 3.5 FAILED
500 4 4 1 8 Y 0.06 -1.00 0.74E-03 3.5 PASSED EVR
'End of tests'
Finished 136 tests, with the following results:
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
On node009 and node010 with 2.0.1
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.23 -1.00 0.18E-02 26. FAILED
500 1 2 1 8 Y 0.10 -1.00 0.74E-03 3.2 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.16 -1.00 0.83E-03 2.3 FAILED
500 1 2 2 8 Y 0.09 -1.00 0.77E-03 2.2 PASSED EVR
500 2 2 2 8 Y 0.07 -1.00 0.81E-03 3.3 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.17 -1.00 0.84E-03 3.5 FAILED
500 4 4 1 8 Y 0.15 -1.00 0.77E-03 3.6 PASSED EVR
'End of tests'
Finished 136 tests, with the following results:
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
On node009 and node010 with 1.10.4
'TEST 10 - test one large matrix'
500 1 1 1 8 Y 0.15 -1.00 0.18E-02 26. PASSED EVR
500 1 2 1 8 Y 0.10 -1.00 0.81E-03 2.7 PASSED EVR
500 1 1 2 8 Y 0.09 -1.00 0.71E-03 3.5 PASSED EVR
500 1 2 2 8 Y 0.09 -1.00 0.82E-03 2.6 PASSED EVR
500 2 2 2 8 Y 0.06 -1.00 0.80E-03 4.4 PASSED EVR
500 4 1 4 8 Y 0.07 -1.00 0.91E-03 3.3 PASSED EVR
500 4 4 1 8 Y 0.16 -1.00 0.83E-03 3.7 PASSED EVR
'End of tests'
Finished 136 tests, with the following results:
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
----- Nachricht von Howard Pritchard <hpprit...@gmail.com> ---------
Datum: Fri, 18 Nov 2016 11:25:06 -0700
Von: Howard Pritchard <hpprit...@gmail.com>
Betreff: Re: [OMPI users] ScaLapack tester fails with 2.0.1, works
with 1.10.4; Intel Omni-Path
An: christof.koeh...@bccms.uni-bremen.de, Open MPI Users
<users@lists.open-mpi.org>
Hi Christof,
Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args)
and see if you still observe the failures?
Howard
2016-11-18 9:32 GMT-07:00 Christof Köhler <
christof.koeh...@bccms.uni-bremen.de>:
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are observed.
The other testers appear to be working with all MPIs mentioned (have to
triple check again). I somehow overlooked the failures below at first.
The system is an Intel OmniPath system (newest Intel driver release 10.2),
i.e. we are using the PSM2
mtl I believe.
I built the OpenMPIs with gcc 6.2 and the following identical options:
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc is a
good baseline.
Any ideas ? For us this is a real problem in that we do not know if this
indicates a network (transport) issue in the intel software stack (libpsm2,
hfi1 kernel module) which might affect our production codes or if this is
an OpenMPI issue. We have some other problems I might ask about later on
this list, but nothing which yields such a nice reproducer and especially
these other problems might well be application related.
Best Regards
Christof
--
Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
----- Ende der Nachricht von Howard Pritchard <hpprit...@gmail.com> -----
--
Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users