Hello everybody, as promised I started to test on my laptop (which has only two physical cores, in case that matters).
As I discovered the story is not as simple as I assumed. I was focusing on xdsyevr when testing on the workstation and overlooked the others. On the cluster the only test which throws errors is xdsyevr with 2.0.1. With 1.10.4 everything is fine. I double checked this by now. On the workstation I get "136 tests completed and failed." in xcheevr with 1.10.4 which I overlooked. With 2.0.1 I get "136 tests completed and failed" in xdsyevr and xssyevr. On the laptop I am not sure yet, I ran out of battery power. But it looked similar to the workstation. Failures with both versions. So, there is certainly a factor unrelated to OpenMPI. It might even be that this failures are complete noise. I will try to investigate this further. If some list member has a good idea how to test and what to look for I would appreciate a hint. Also, perhaps someone could try to replicate this. Thank you for your help so far. Best Regards Christof On Tue, Nov 22, 2016 at 10:35:57PM +0900, Gilles Gouaillardet wrote: > Christoph, > > out of curiosity, could you try to > mpirun --mca coll ^tuned ... > and see if it helps ? > > Cheers, > > Gilles > > > On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler > <christof.koeh...@bccms.uni-bremen.de> wrote: > > Hello again, > > > > I tried to replicate the situation on the workstation at my desk, > > running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and > > blas libraries. > > > > With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and > > failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but > > reasonable looking numbers as described before. > > > > With 1.10 I get "136 tests completed and passed residual checks." > > instead as observed before. > > > > So this is likely not an Omni-Path problem but something else in 2.0.1. > > > > I should eventually clarify that I am using the current revision 206 from > > the scalapack trunk (svn co > > https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk) > > but if I remember correctly I had very similar problems with the 2.0.2 > > release tarball. > > > > Both MPIs were built with > > ./configure --with-hwloc=internal --enable-static > > --enable-orterun-prefix-by-default > > > > > > Best Regards > > > > Christof > > > > On Fri, Nov 18, 2016 at 11:25:06AM -0700, Howard Pritchard wrote: > >> Hi Christof, > >> > >> Thanks for trying out 2.0.1. Sorry that you're hitting problems. > >> Could you try to run the tests using the 'ob1' PML in order to > >> bypass PSM2? > >> > >> mpirun --mca pml ob1 (all the rest of the args) > >> > >> and see if you still observe the failures? > >> > >> Howard > >> > >> > >> 2016-11-18 9:32 GMT-07:00 Christof Köhler < > >> christof.koeh...@bccms.uni-bremen.de>: > >> > >> > Hello everybody, > >> > > >> > I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests > >> > when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no > >> > failures are observed. Also, with mvapich2 2.2 no failures are observed. > >> > The other testers appear to be working with all MPIs mentioned (have to > >> > triple check again). I somehow overlooked the failures below at first. > >> > > >> > The system is an Intel OmniPath system (newest Intel driver release > >> > 10.2), > >> > i.e. we are using the PSM2 > >> > mtl I believe. > >> > > >> > I built the OpenMPIs with gcc 6.2 and the following identical options: > >> > ./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1" > >> > --with-psm2 --with-tm --with-hwloc=internal --enable-static > >> > --enable-orterun-prefix-by-default > >> > > >> > The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1 > >> > -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler > >> > changes. > >> > > >> > With OpenMPI 1.10.4 I see on a single node > >> > > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > >> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 > >> > ./xdsyevr > >> > 136 tests completed and passed residual checks. > >> > 0 tests completed without checking. > >> > 0 tests skipped for lack of memory. > >> > 0 tests completed and failed. > >> > > >> > With OpenMPI 1.10.4 I see on two nodes > >> > > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > >> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 > >> > ./xdsyevr > >> > 136 tests completed and passed residual checks. > >> > 0 tests completed without checking. > >> > 0 tests skipped for lack of memory. > >> > 0 tests completed and failed. > >> > > >> > With OpenMPI 2.0.1 I see on a single node > >> > > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > >> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009 > >> > ./xdsyevr > >> > 32 tests completed and passed residual checks. > >> > 0 tests completed without checking. > >> > 0 tests skipped for lack of memory. > >> > 104 tests completed and failed. > >> > > >> > With OpenMPI 2.0.1 I see on two nodes > >> > > >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca > >> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010 > >> > ./xdsyevr > >> > 32 tests completed and passed residual checks. > >> > 0 tests completed without checking. > >> > 0 tests skipped for lack of memory. > >> > 104 tests completed and failed. > >> > > >> > A typical failure looks like this in the output > >> > > >> > IL, IU, VL or VU altered by PDSYEVR > >> > 500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED > >> > 500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED > >> > EVR > >> > IL, IU, VL or VU altered by PDSYEVR > >> > 500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED > >> > 500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED > >> > EVR > >> > 500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED > >> > EVR > >> > IL, IU, VL or VU altered by PDSYEVR > >> > 500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED > >> > 500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED > >> > EVR > >> > > >> > > >> > The variable OMP_NUM_THREADS=1 to stop the openblas from threading. > >> > We see similar problems with intel 2016 compilers, but I believe gcc is a > >> > good baseline. > >> > > >> > Any ideas ? For us this is a real problem in that we do not know if this > >> > indicates a network (transport) issue in the intel software stack > >> > (libpsm2, > >> > hfi1 kernel module) which might affect our production codes or if this is > >> > an OpenMPI issue. We have some other problems I might ask about later on > >> > this list, but nothing which yields such a nice reproducer and especially > >> > these other problems might well be application related. > >> > > >> > Best Regards > >> > > >> > Christof > >> > > >> > -- > >> > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > >> > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > >> > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > >> > 28359 Bremen > >> > > >> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > >> > > >> > _______________________________________________ > >> > users mailing list > >> > users@lists.open-mpi.org > >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > -- > > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > > 28359 Bremen > > > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 28359 Bremen PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users