Hello, On Tue, Nov 22, 2016 at 10:35:57PM +0900, Gilles Gouaillardet wrote: > Christoph, > > out of curiosity, could you try to > mpirun --mca coll ^tuned ... > and see if it helps ?
No, at least not for the workstation example. I will test with my laptop (debian stable) tomorrow. Thank you all for your help ! This is really strange. Cheers Christof > > Cheers, > > Gilles > > > On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler > <christof.koeh...@bccms.uni-bremen.de> wrote: > > Hello again, > > > > I tried to replicate the situation on the workstation at my desk, > > running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and > > blas libraries. > > > > With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and > > failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but > > reasonable looking numbers as described before. > > > > With 1.10 I get "136 tests completed and passed residual checks." > > instead as observed before. > > > > So this is likely not an Omni-Path problem but something else in 2.0.1. > > > > I should eventually clarify that I am using the current revision 206 from > > the scalapack trunk (svn co > > https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk) > > but if I remember correctly I had very similar problems with the 2.0.2 > > release tarball. > > > > Both MPIs were built with > > ./configure --with-hwloc=internal --enable-static > > --enable-orterun-prefix-by-default > > > > > > Best Regards > > > > Christof > > > > On Fri, Nov 18, 2016 at 11:25:06AM -0700, Howard Pritchard wrote: > >> Hi Christof, > >> > >> Thanks for trying out 2.0.1. Sorry that you're hitting problems. > >> Could you try to run the tests using the 'ob1' PML in order to > >> bypass PSM2? > >> > >> mpirun --mca pml ob1 (all the rest of the args) > >> > >> and see if you still observe the failures? > >> > >> Howard > >> > >> > >> 2016-11-18 9:32 GMT-07:00 Christof Köhler < > >> christof.koeh...@bccms.uni-bremen.de>: > >> > >> > Hello everybody, > >> > > >> > I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests > >> > when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no > >> > failures are observed. Also, with mvapich2 2.2 no failures are observed. > >> > The other testers appear to be working with all MPIs mentioned (have to > >> > triple check again). I somehow overlooked the failures below at first. > >> > > >> > The system is an Intel OmniPath system (newest Intel driver release > >> > 10.2), > >> > i.e. we are using the PSM2 > >> > mtl I believe. > >> > > >> > I built the OpenMPIs with gcc 6.2 and the following identical options: > >> > ./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1" > >> > --with-psm2 --with-tm --with-hwloc=internal --enable-static > >> > --enable-orterun-prefix-by-default > >> > > >> > The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1 > >> > -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler > >> > changes. > >> > > >> > With OpenMPI 1.10.4 I see on a single node > >> > > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > >> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 > >> > ./xdsyevr > >> > 136 tests completed and passed residual checks. > >> > 0 tests completed without checking. > >> > 0 tests skipped for lack of memory. > >> > 0 tests completed and failed. > >> > > >> > With OpenMPI 1.10.4 I see on two nodes > >> > > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > >> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 > >> > ./xdsyevr > >> > 136 tests completed and passed residual checks. > >> > 0 tests completed without checking. > >> > 0 tests skipped for lack of memory. > >> > 0 tests completed and failed. > >> > > >> > With OpenMPI 2.0.1 I see on a single node > >> > > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > >> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 > >> > ./xdsyevr > >> > 32 tests completed and passed residual checks. > >> > 0 tests completed without checking. > >> > 0 tests skipped for lack of memory. > >> > 104 tests completed and failed. > >> > > >> > With OpenMPI 2.0.1 I see on two nodes > >> > > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > >> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 > >> > ./xdsyevr > >> > 32 tests completed and passed residual checks. > >> > 0 tests completed without checking. > >> > 0 tests skipped for lack of memory. > >> > 104 tests completed and failed. > >> > > >> > A typical failure looks like this in the output > >> > > >> > IL, IU, VL or VU altered by PDSYEVR > >> > 500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED > >> > 500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED > >> > EVR > >> > IL, IU, VL or VU altered by PDSYEVR > >> > 500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED > >> > 500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED > >> > EVR > >> > 500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED > >> > EVR > >> > IL, IU, VL or VU altered by PDSYEVR > >> > 500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED > >> > 500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED > >> > EVR > >> > > >> > > >> > The variable OMP_NUM_THREADS=1 to stop the openblas from threading. > >> > We see similar problems with intel 2016 compilers, but I believe gcc is a > >> > good baseline. > >> > > >> > Any ideas ? For us this is a real problem in that we do not know if this > >> > indicates a network (transport) issue in the intel software stack > >> > (libpsm2, > >> > hfi1 kernel module) which might affect our production codes or if this is > >> > an OpenMPI issue. We have some other problems I might ask about later on > >> > this list, but nothing which yields such a nice reproducer and especially > >> > these other problems might well be application related. > >> > > >> > Best Regards > >> > > >> > Christof > >> > > >> > -- > >> > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > >> > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > >> > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > >> > 28359 Bremen > >> > > >> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > >> > > >> > _______________________________________________ > >> > users mailing list > >> > users@lists.open-mpi.org > >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > -- > > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > > 28359 Bremen > > > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 28359 Bremen PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users