Hello again,

I tried to replicate the situation on the workstation at my desk,
running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
blas libraries.

With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and 
failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but 
reasonable looking numbers as described before.

With 1.10 I get "136 tests completed and passed residual checks."
instead as observed before.

So this is likely not an Omni-Path problem but something else in 2.0.1.

I should eventually clarify that I am using the current revision 206 from
the scalapack trunk (svn co 
https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk)
but if I remember correctly I had very similar problems with the 2.0.2
release tarball.

Both MPIs were built with 
./configure --with-hwloc=internal --enable-static 
--enable-orterun-prefix-by-default


Best Regards

Christof

On Fri, Nov 18, 2016 at 11:25:06AM -0700, Howard Pritchard wrote:
> Hi Christof,
> 
> Thanks for trying out 2.0.1.  Sorry that you're hitting problems.
> Could you try to run the tests using the 'ob1' PML in order to
> bypass PSM2?
> 
> mpirun --mca pml ob1 (all the rest of the args)
> 
> and see if you still observe the failures?
> 
> Howard
> 
> 
> 2016-11-18 9:32 GMT-07:00 Christof Köhler <
> christof.koeh...@bccms.uni-bremen.de>:
> 
> > Hello everybody,
> >
> > I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
> > when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
> > failures are observed. Also, with mvapich2 2.2 no failures are observed.
> > The other testers appear to be working with all MPIs mentioned (have to
> > triple check again). I somehow overlooked the failures below at first.
> >
> > The system is an Intel OmniPath system (newest Intel driver release 10.2),
> > i.e. we are using the PSM2
> > mtl I believe.
> >
> > I built the OpenMPIs with gcc 6.2 and the following identical options:
> > ./configure  FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
> > --with-psm2 --with-tm --with-hwloc=internal --enable-static
> > --enable-orterun-prefix-by-default
> >
> > The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
> > -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
> > changes.
> >
> > With OpenMPI 1.10.4 I see on a single node
> >
> >  mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
> > ./xdsyevr
> > 136 tests completed and passed residual checks.
> >     0 tests completed without checking.
> >     0 tests skipped for lack of memory.
> >     0 tests completed and failed.
> >
> > With OpenMPI 1.10.4 I see on two nodes
> >
> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
> > ./xdsyevr
> >   136 tests completed and passed residual checks.
> >     0 tests completed without checking.
> >     0 tests skipped for lack of memory.
> >     0 tests completed and failed.
> >
> > With OpenMPI 2.0.1 I see on a single node
> >
> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
> > ./xdsyevr
> > 32 tests completed and passed residual checks.
> >     0 tests completed without checking.
> >     0 tests skipped for lack of memory.
> >   104 tests completed and failed.
> >
> > With OpenMPI 2.0.1 I see on two nodes
> >
> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
> > ./xdsyevr
> >    32 tests completed and passed residual checks.
> >     0 tests completed without checking.
> >     0 tests skipped for lack of memory.
> >   104 tests completed and failed.
> >
> > A typical failure looks like this in the output
> >
> > IL, IU, VL or VU altered by PDSYEVR
> >    500   1   1   1   8   Y     0.26    -1.00  0.19E-02   15.     FAILED
> >    500   1   2   1   8   Y     0.29    -1.00  0.79E-03   3.9     PASSED
> >  EVR
> > IL, IU, VL or VU altered by PDSYEVR
> >    500   1   1   2   8   Y     0.52    -1.00  0.82E-03   2.5     FAILED
> >    500   1   2   2   8   Y     0.41    -1.00  0.79E-03   2.3     PASSED
> >  EVR
> >    500   2   2   2   8   Y     0.18    -1.00  0.78E-03   3.0     PASSED
> >  EVR
> > IL, IU, VL or VU altered by PDSYEVR
> >    500   4   1   4   8   Y     0.09    -1.00  0.95E-03   4.1     FAILED
> >    500   4   4   1   8   Y     0.11    -1.00  0.91E-03   2.8     PASSED
> >  EVR
> >
> >
> > The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
> > We see similar problems with intel 2016 compilers, but I believe gcc is a
> > good baseline.
> >
> > Any ideas ? For us this is a real problem in that we do not know if this
> > indicates a network (transport) issue in the intel software stack (libpsm2,
> > hfi1 kernel module) which might affect our production codes or if this is
> > an OpenMPI issue. We have some other problems I might ask about later on
> > this list, but nothing which yields such a nice reproducer and especially
> > these other problems might well be application related.
> >
> > Best Regards
> >
> > Christof
> >
> > --
> > Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > 28359 Bremen
> >
> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users

-- 
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to