Hello everybody,

I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no failures are observed. Also, with mvapich2 2.2 no failures are observed. The other testers appear to be working with all MPIs mentioned (have to triple check again). I somehow overlooked the failures below at first.

The system is an Intel OmniPath system (newest Intel driver release 10.2), i.e. we are using the PSM2
mtl I believe.

I built the OpenMPIs with gcc 6.2 and the following identical options:
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1" --with-psm2 --with-tm --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default

The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1 -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler changes.

With OpenMPI 1.10.4 I see on a single node

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 ./xdsyevr
136 tests completed and passed residual checks.
    0 tests completed without checking.
    0 tests skipped for lack of memory.
    0 tests completed and failed.

With OpenMPI 1.10.4 I see on two nodes

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 ./xdsyevr
  136 tests completed and passed residual checks.
    0 tests completed without checking.
    0 tests skipped for lack of memory.
    0 tests completed and failed.

With OpenMPI 2.0.1 I see on a single node

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 ./xdsyevr
32 tests completed and passed residual checks.
    0 tests completed without checking.
    0 tests skipped for lack of memory.
  104 tests completed and failed.

With OpenMPI 2.0.1 I see on two nodes

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 ./xdsyevr
   32 tests completed and passed residual checks.
    0 tests completed without checking.
    0 tests skipped for lack of memory.
  104 tests completed and failed.

A typical failure looks like this in the output

IL, IU, VL or VU altered by PDSYEVR
   500   1   1   1   8   Y     0.26    -1.00  0.19E-02   15.     FAILED
   500   1   2   1   8   Y     0.29    -1.00  0.79E-03   3.9     PASSED   EVR
IL, IU, VL or VU altered by PDSYEVR
   500   1   1   2   8   Y     0.52    -1.00  0.82E-03   2.5     FAILED
   500   1   2   2   8   Y     0.41    -1.00  0.79E-03   2.3     PASSED   EVR
   500   2   2   2   8   Y     0.18    -1.00  0.78E-03   3.0     PASSED   EVR
IL, IU, VL or VU altered by PDSYEVR
   500   4   1   4   8   Y     0.09    -1.00  0.95E-03   4.1     FAILED
   500   4   4   1   8   Y     0.11    -1.00  0.91E-03   2.8     PASSED   EVR


The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc is a good baseline.

Any ideas ? For us this is a real problem in that we do not know if this indicates a network (transport) issue in the intel software stack (libpsm2, hfi1 kernel module) which might affect our production codes or if this is an OpenMPI issue. We have some other problems I might ask about later on this list, but nothing which yields such a nice reproducer and especially these other problems might well be application related.

Best Regards

Christof

--
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to