Hello again, I tried to replicate the situation on the workstation at my desk, running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and blas libraries.
With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but reasonable looking numbers as described before. With 1.10 I get "136 tests completed and passed residual checks." instead as observed before. So this is likely not an Omni-Path problem but something else in 2.0.1. I should eventually clarify that I am using the current revision 206 from the scalapack trunk (svn co https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk) but if I remember correctly I had very similar problems with the 2.0.2 release tarball. Both MPIs were built with ./configure --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default Best Regards Christof On Fri, Nov 18, 2016 at 11:25:06AM -0700, Howard Pritchard wrote: > Hi Christof, > > Thanks for trying out 2.0.1. Sorry that you're hitting problems. > Could you try to run the tests using the 'ob1' PML in order to > bypass PSM2? > > mpirun --mca pml ob1 (all the rest of the args) > > and see if you still observe the failures? > > Howard > > > 2016-11-18 9:32 GMT-07:00 Christof Köhler < > christof.koeh...@bccms.uni-bremen.de>: > > > Hello everybody, > > > > I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests > > when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no > > failures are observed. Also, with mvapich2 2.2 no failures are observed. > > The other testers appear to be working with all MPIs mentioned (have to > > triple check again). I somehow overlooked the failures below at first. > > > > The system is an Intel OmniPath system (newest Intel driver release 10.2), > > i.e. we are using the PSM2 > > mtl I believe. > > > > I built the OpenMPIs with gcc 6.2 and the following identical options: > > ./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1" > > --with-psm2 --with-tm --with-hwloc=internal --enable-static > > --enable-orterun-prefix-by-default > > > > The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1 > > -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler > > changes. > > > > With OpenMPI 1.10.4 I see on a single node > > > > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 > > ./xdsyevr > > 136 tests completed and passed residual checks. > > 0 tests completed without checking. > > 0 tests skipped for lack of memory. > > 0 tests completed and failed. > > > > With OpenMPI 1.10.4 I see on two nodes > > > > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 > > ./xdsyevr > > 136 tests completed and passed residual checks. > > 0 tests completed without checking. > > 0 tests skipped for lack of memory. > > 0 tests completed and failed. > > > > With OpenMPI 2.0.1 I see on a single node > > > > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 > > ./xdsyevr > > 32 tests completed and passed residual checks. > > 0 tests completed without checking. > > 0 tests skipped for lack of memory. > > 104 tests completed and failed. > > > > With OpenMPI 2.0.1 I see on two nodes > > > > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 > > ./xdsyevr > > 32 tests completed and passed residual checks. > > 0 tests completed without checking. > > 0 tests skipped for lack of memory. > > 104 tests completed and failed. > > > > A typical failure looks like this in the output > > > > IL, IU, VL or VU altered by PDSYEVR > > 500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED > > 500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED > > EVR > > IL, IU, VL or VU altered by PDSYEVR > > 500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED > > 500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED > > EVR > > 500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED > > EVR > > IL, IU, VL or VU altered by PDSYEVR > > 500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED > > 500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED > > EVR > > > > > > The variable OMP_NUM_THREADS=1 to stop the openblas from threading. > > We see similar problems with intel 2016 compilers, but I believe gcc is a > > good baseline. > > > > Any ideas ? For us this is a real problem in that we do not know if this > > indicates a network (transport) issue in the intel software stack (libpsm2, > > hfi1 kernel module) which might affect our production codes or if this is > > an OpenMPI issue. We have some other problems I might ask about later on > > this list, but nothing which yields such a nice reproducer and especially > > these other problems might well be application related. > > > > Best Regards > > > > Christof > > > > -- > > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > > 28359 Bremen > > > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 28359 Bremen PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users