Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self
tests when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4
no failures are observed. Also, with mvapich2 2.2 no failures are
observed.
The other testers appear to be working with all MPIs mentioned (have
to triple check again). I somehow overlooked the failures below at
first.
The system is an Intel OmniPath system (newest Intel driver release
10.2), i.e. we are using the PSM2
mtl I believe.
I built the OpenMPIs with gcc 6.2 and the following identical options:
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using
"-O1 -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper
compiler changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc
is a good baseline.
Any ideas ? For us this is a real problem in that we do not know if
this indicates a network (transport) issue in the intel software stack
(libpsm2, hfi1 kernel module) which might affect our production codes
or if this is an OpenMPI issue. We have some other problems I might
ask about later on this list, but nothing which yields such a nice
reproducer and especially these other problems might well be
application related.
Best Regards
Christof
--
Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users