Christoph,

out of curiosity, could you try to
mpirun --mca coll ^tuned ...
and see if it helps ?

Cheers,

Gilles


On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler
<christof.koeh...@bccms.uni-bremen.de> wrote:
> Hello again,
>
> I tried to replicate the situation on the workstation at my desk,
> running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
> blas libraries.
>
> With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and
> failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but
> reasonable looking numbers as described before.
>
> With 1.10 I get "136 tests completed and passed residual checks."
> instead as observed before.
>
> So this is likely not an Omni-Path problem but something else in 2.0.1.
>
> I should eventually clarify that I am using the current revision 206 from
> the scalapack trunk (svn co 
> https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk)
> but if I remember correctly I had very similar problems with the 2.0.2
> release tarball.
>
> Both MPIs were built with
> ./configure --with-hwloc=internal --enable-static 
> --enable-orterun-prefix-by-default
>
>
> Best Regards
>
> Christof
>
> On Fri, Nov 18, 2016 at 11:25:06AM -0700, Howard Pritchard wrote:
>> Hi Christof,
>>
>> Thanks for trying out 2.0.1.  Sorry that you're hitting problems.
>> Could you try to run the tests using the 'ob1' PML in order to
>> bypass PSM2?
>>
>> mpirun --mca pml ob1 (all the rest of the args)
>>
>> and see if you still observe the failures?
>>
>> Howard
>>
>>
>> 2016-11-18 9:32 GMT-07:00 Christof Köhler <
>> christof.koeh...@bccms.uni-bremen.de>:
>>
>> > Hello everybody,
>> >
>> > I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
>> > when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
>> > failures are observed. Also, with mvapich2 2.2 no failures are observed.
>> > The other testers appear to be working with all MPIs mentioned (have to
>> > triple check again). I somehow overlooked the failures below at first.
>> >
>> > The system is an Intel OmniPath system (newest Intel driver release 10.2),
>> > i.e. we are using the PSM2
>> > mtl I believe.
>> >
>> > I built the OpenMPIs with gcc 6.2 and the following identical options:
>> > ./configure  FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
>> > --with-psm2 --with-tm --with-hwloc=internal --enable-static
>> > --enable-orterun-prefix-by-default
>> >
>> > The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
>> > -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
>> > changes.
>> >
>> > With OpenMPI 1.10.4 I see on a single node
>> >
>> >  mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
>> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
>> > ./xdsyevr
>> > 136 tests completed and passed residual checks.
>> >     0 tests completed without checking.
>> >     0 tests skipped for lack of memory.
>> >     0 tests completed and failed.
>> >
>> > With OpenMPI 1.10.4 I see on two nodes
>> >
>> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
>> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
>> > ./xdsyevr
>> >   136 tests completed and passed residual checks.
>> >     0 tests completed without checking.
>> >     0 tests skipped for lack of memory.
>> >     0 tests completed and failed.
>> >
>> > With OpenMPI 2.0.1 I see on a single node
>> >
>> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
>> > oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
>> > ./xdsyevr
>> > 32 tests completed and passed residual checks.
>> >     0 tests completed without checking.
>> >     0 tests skipped for lack of memory.
>> >   104 tests completed and failed.
>> >
>> > With OpenMPI 2.0.1 I see on two nodes
>> >
>> > mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
>> > oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
>> > ./xdsyevr
>> >    32 tests completed and passed residual checks.
>> >     0 tests completed without checking.
>> >     0 tests skipped for lack of memory.
>> >   104 tests completed and failed.
>> >
>> > A typical failure looks like this in the output
>> >
>> > IL, IU, VL or VU altered by PDSYEVR
>> >    500   1   1   1   8   Y     0.26    -1.00  0.19E-02   15.     FAILED
>> >    500   1   2   1   8   Y     0.29    -1.00  0.79E-03   3.9     PASSED
>> >  EVR
>> > IL, IU, VL or VU altered by PDSYEVR
>> >    500   1   1   2   8   Y     0.52    -1.00  0.82E-03   2.5     FAILED
>> >    500   1   2   2   8   Y     0.41    -1.00  0.79E-03   2.3     PASSED
>> >  EVR
>> >    500   2   2   2   8   Y     0.18    -1.00  0.78E-03   3.0     PASSED
>> >  EVR
>> > IL, IU, VL or VU altered by PDSYEVR
>> >    500   4   1   4   8   Y     0.09    -1.00  0.95E-03   4.1     FAILED
>> >    500   4   4   1   8   Y     0.11    -1.00  0.91E-03   2.8     PASSED
>> >  EVR
>> >
>> >
>> > The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
>> > We see similar problems with intel 2016 compilers, but I believe gcc is a
>> > good baseline.
>> >
>> > Any ideas ? For us this is a real problem in that we do not know if this
>> > indicates a network (transport) issue in the intel software stack (libpsm2,
>> > hfi1 kernel module) which might affect our production codes or if this is
>> > an OpenMPI issue. We have some other problems I might ask about later on
>> > this list, but nothing which yields such a nice reproducer and especially
>> > these other problems might well be application related.
>> >
>> > Best Regards
>> >
>> > Christof
>> >
>> > --
>> > Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
>> > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
>> > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
>> > 28359 Bremen
>> >
>> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
>> >
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> --
> Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> 28359 Bremen
>
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to