Hello,

On Tue, Nov 22, 2016 at 10:35:57PM +0900, Gilles Gouaillardet wrote:
> Christoph,
> 
> out of curiosity, could you try to
> mpirun --mca coll ^tuned ...
> and see if it helps ?

No, at least not for the workstation example. I will test with my laptop
(debian stable) tomorrow.

Thank you all for your help ! This is really strange.

Cheers


Christof

> 
> Cheers,
> 
> Gilles
> 
> 
> On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler
> <christof.koeh...@bccms.uni-bremen.de> wrote:
> > Hello again,
> >
> > I tried to replicate the situation on the workstation at my desk,
> > running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
> > blas libraries.
> >
> > With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and
> > failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but
> > reasonable looking numbers as described before.
> >
> > With 1.10 I get "136 tests completed and passed residual checks."
> > instead as observed before.
> >
> > So this is likely not an Omni-Path problem but something else in 2.0.1.
> >
> > I should eventually clarify that I am using the current revision 206 from
> > the scalapack trunk (svn co 
> > https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk)
> > but if I remember correctly I had very similar problems with the 2.0.2
> > release tarball.
> >
> > Both MPIs were built with
> > ./configure --with-hwloc=internal --enable-static 
> > --enable-orterun-prefix-by-default
> >
> >
> > Best Regards
> >
> > Christof
> >
> > On Fri, Nov 18, 2016 at 11:25:06AM -0700, Howard Pritchard wrote:
> >> Hi Christof,
> >>
> >> Thanks for trying out 2.0.1.  Sorry that you're hitting problems.
> >> Could you try to run the tests using the 'ob1' PML in order to
> >> bypass PSM2?
> >>
> >> mpirun --mca pml ob1 (all the rest of the args)
> >>
> >> and see if you still observe the failures?
> >>
> >> Howard
> >>
> >>
> >> 2016-11-18 9:32 GMT-07:00 Christof Köhler <
> >> christof.koeh...@bccms.uni-bremen.de>:
> >>
> >> > Hello everybody,
> >> >
> >> > I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
> >> > when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
> >> > failures are observed. Also, with mvapich2 2.2 no failures are observed.
> >> > The other testers appear to be working with all MPIs mentioned (have to
> >> > triple check again). I somehow overlooked the failures below at first.
> >> >
> >> > The system is an Intel OmniPath system (newest Intel driver release 
> >> > 10.2),
> >> > i.e. we are using the PSM2
> >> > mtl I believe.
> >> >
> >> > I built the OpenMPIs with gcc 6.2 and the following identical options:
> >> > ./configure  FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
> >> > --with-psm2 --with-tm --with-hwloc=internal --enable-static
> >> > --enable-orterun-prefix-by-default
> >> >
> >> > The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
> >> > -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
> >> > changes.
> >> >
> >> > With OpenMPI 1.10.4 I see on a single node
> >> >
> >> >  mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> >> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
> >> > ./xdsyevr
> >> > 136 tests completed and passed residual checks.
> >> >     0 tests completed without checking.
> >> >     0 tests skipped for lack of memory.
> >> >     0 tests completed and failed.
> >> >
> >> > With OpenMPI 1.10.4 I see on two nodes
> >> >
> >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> >> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
> >> > ./xdsyevr
> >> >   136 tests completed and passed residual checks.
> >> >     0 tests completed without checking.
> >> >     0 tests skipped for lack of memory.
> >> >     0 tests completed and failed.
> >> >
> >> > With OpenMPI 2.0.1 I see on a single node
> >> >
> >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> >> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
> >> > ./xdsyevr
> >> > 32 tests completed and passed residual checks.
> >> >     0 tests completed without checking.
> >> >     0 tests skipped for lack of memory.
> >> >   104 tests completed and failed.
> >> >
> >> > With OpenMPI 2.0.1 I see on two nodes
> >> >
> >> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> >> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
> >> > ./xdsyevr
> >> >    32 tests completed and passed residual checks.
> >> >     0 tests completed without checking.
> >> >     0 tests skipped for lack of memory.
> >> >   104 tests completed and failed.
> >> >
> >> > A typical failure looks like this in the output
> >> >
> >> > IL, IU, VL or VU altered by PDSYEVR
> >> >    500   1   1   1   8   Y     0.26    -1.00  0.19E-02   15.     FAILED
> >> >    500   1   2   1   8   Y     0.29    -1.00  0.79E-03   3.9     PASSED
> >> >  EVR
> >> > IL, IU, VL or VU altered by PDSYEVR
> >> >    500   1   1   2   8   Y     0.52    -1.00  0.82E-03   2.5     FAILED
> >> >    500   1   2   2   8   Y     0.41    -1.00  0.79E-03   2.3     PASSED
> >> >  EVR
> >> >    500   2   2   2   8   Y     0.18    -1.00  0.78E-03   3.0     PASSED
> >> >  EVR
> >> > IL, IU, VL or VU altered by PDSYEVR
> >> >    500   4   1   4   8   Y     0.09    -1.00  0.95E-03   4.1     FAILED
> >> >    500   4   4   1   8   Y     0.11    -1.00  0.91E-03   2.8     PASSED
> >> >  EVR
> >> >
> >> >
> >> > The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
> >> > We see similar problems with intel 2016 compilers, but I believe gcc is a
> >> > good baseline.
> >> >
> >> > Any ideas ? For us this is a real problem in that we do not know if this
> >> > indicates a network (transport) issue in the intel software stack 
> >> > (libpsm2,
> >> > hfi1 kernel module) which might affect our production codes or if this is
> >> > an OpenMPI issue. We have some other problems I might ask about later on
> >> > this list, but nothing which yields such a nice reproducer and especially
> >> > these other problems might well be application related.
> >> >
> >> > Best Regards
> >> >
> >> > Christof
> >> >
> >> > --
> >> > Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> >> > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> >> > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> >> > 28359 Bremen
> >> >
> >> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> >> >
> >> > _______________________________________________
> >> > users mailing list
> >> > users@lists.open-mpi.org
> >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> > --
> > Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > 28359 Bremen
> >
> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users

-- 
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to