Christoph, out of curiosity, could you try to mpirun --mca coll ^tuned ... and see if it helps ?
Cheers, Gilles On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler <christof.koeh...@bccms.uni-bremen.de> wrote: > Hello again, > > I tried to replicate the situation on the workstation at my desk, > running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and > blas libraries. > > With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and > failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but > reasonable looking numbers as described before. > > With 1.10 I get "136 tests completed and passed residual checks." > instead as observed before. > > So this is likely not an Omni-Path problem but something else in 2.0.1. > > I should eventually clarify that I am using the current revision 206 from > the scalapack trunk (svn co > https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk) > but if I remember correctly I had very similar problems with the 2.0.2 > release tarball. > > Both MPIs were built with > ./configure --with-hwloc=internal --enable-static > --enable-orterun-prefix-by-default > > > Best Regards > > Christof > > On Fri, Nov 18, 2016 at 11:25:06AM -0700, Howard Pritchard wrote: >> Hi Christof, >> >> Thanks for trying out 2.0.1. Sorry that you're hitting problems. >> Could you try to run the tests using the 'ob1' PML in order to >> bypass PSM2? >> >> mpirun --mca pml ob1 (all the rest of the args) >> >> and see if you still observe the failures? >> >> Howard >> >> >> 2016-11-18 9:32 GMT-07:00 Christof Köhler < >> christof.koeh...@bccms.uni-bremen.de>: >> >> > Hello everybody, >> > >> > I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests >> > when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no >> > failures are observed. Also, with mvapich2 2.2 no failures are observed. >> > The other testers appear to be working with all MPIs mentioned (have to >> > triple check again). I somehow overlooked the failures below at first. >> > >> > The system is an Intel OmniPath system (newest Intel driver release 10.2), >> > i.e. we are using the PSM2 >> > mtl I believe. >> > >> > I built the OpenMPIs with gcc 6.2 and the following identical options: >> > ./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1" >> > --with-psm2 --with-tm --with-hwloc=internal --enable-static >> > --enable-orterun-prefix-by-default >> > >> > The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1 >> > -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler >> > changes. >> > >> > With OpenMPI 1.10.4 I see on a single node >> > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca >> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 >> > ./xdsyevr >> > 136 tests completed and passed residual checks. >> > 0 tests completed without checking. >> > 0 tests skipped for lack of memory. >> > 0 tests completed and failed. >> > >> > With OpenMPI 1.10.4 I see on two nodes >> > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca >> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 >> > ./xdsyevr >> > 136 tests completed and passed residual checks. >> > 0 tests completed without checking. >> > 0 tests skipped for lack of memory. >> > 0 tests completed and failed. >> > >> > With OpenMPI 2.0.1 I see on a single node >> > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca >> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 >> > ./xdsyevr >> > 32 tests completed and passed residual checks. >> > 0 tests completed without checking. >> > 0 tests skipped for lack of memory. >> > 104 tests completed and failed. >> > >> > With OpenMPI 2.0.1 I see on two nodes >> > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca >> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 >> > ./xdsyevr >> > 32 tests completed and passed residual checks. >> > 0 tests completed without checking. >> > 0 tests skipped for lack of memory. >> > 104 tests completed and failed. >> > >> > A typical failure looks like this in the output >> > >> > IL, IU, VL or VU altered by PDSYEVR >> > 500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED >> > 500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED >> > EVR >> > IL, IU, VL or VU altered by PDSYEVR >> > 500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED >> > 500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED >> > EVR >> > 500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED >> > EVR >> > IL, IU, VL or VU altered by PDSYEVR >> > 500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED >> > 500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED >> > EVR >> > >> > >> > The variable OMP_NUM_THREADS=1 to stop the openblas from threading. >> > We see similar problems with intel 2016 compilers, but I believe gcc is a >> > good baseline. >> > >> > Any ideas ? For us this is a real problem in that we do not know if this >> > indicates a network (transport) issue in the intel software stack (libpsm2, >> > hfi1 kernel module) which might affect our production codes or if this is >> > an OpenMPI issue. We have some other problems I might ask about later on >> > this list, but nothing which yields such a nice reproducer and especially >> > these other problems might well be application related. >> > >> > Best Regards >> > >> > Christof >> > >> > -- >> > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de >> > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 >> > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 >> > 28359 Bremen >> > >> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ >> > >> > _______________________________________________ >> > users mailing list >> > users@lists.open-mpi.org >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > -- > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > 28359 Bremen > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users