date:20161208

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler

Hello everybody,

I tried it with the nightly and the direct 2.0.2 branch from git which
according to the log should contain that patch

commit d0b97d7a408b87425ca53523de369da405358ba2
Merge: ac8c019 b9420bb
Author: Jeff Squyres 
Date:   Wed Dec 7 18:24:46 2016 -0500
Merge pull request #2528 from rhc54/cmr20x/signals

Unfortunately it changes nothing. The root rank stops and all other
ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
apparently in that allreduce. The stack trace looks a bit more
interesting (git is always debug build ?), so I include it at the very 
bottom just in case.

Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
__exit etc. to try to catch signals. Would that be useful ? I need a
moment to figure out how to do this, but I can definitively try.

Some remark: During "make install" from the git repo I see a 

WARNING!  Common symbols found:
  mpi-f08-types.o: 0004 C ompi_f08_mpi_2complex
  mpi-f08-types.o: 0004 C ompi_f08_mpi_2double_complex
  mpi-f08-types.o: 0004 C ompi_f08_mpi_2double_precision
  mpi-f08-types.o: 0004 C ompi_f08_mpi_2integer
  mpi-f08-types.o: 0004 C ompi_f08_mpi_2real
  mpi-f08-types.o: 0004 C ompi_f08_mpi_aint
  mpi-f08-types.o: 0004 C ompi_f08_mpi_band
  mpi-f08-types.o: 0004 C ompi_f08_mpi_bor
  mpi-f08-types.o: 0004 C ompi_f08_mpi_bxor
  mpi-f08-types.o: 0004 C ompi_f08_mpi_byte

I have never noticed this before.


Best Regards

Christof

Thread 1 (Thread 0x2af84cde4840 (LWP 11219)):
#0  0x2af84e4c669d in poll () from /lib64/libc.so.6
#1  0x2af850517496 in poll_dispatch () from 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#2  0x2af85050ffa5 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3  0x2af85049fa1f in opal_progress () at runtime/opal_progress.c:207
#4  0x2af84e02f7f7 in ompi_request_default_wait_all (count=233618144, 
requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
#5  0x2af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling 
(sbuf=0xdecbae0,
rbuf=0x2, count=0, dtype=0x, op=0x0, comm=0x1, 
module=0xdee69e0) at base/coll_base_allreduce.c:225
#6  0x2af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed 
(sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0x, op=0x0, comm=0x1, 
module=0x1) at coll_tuned_decision_fixed.c:66
#7  0x2af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2, 
count=0, datatype=0x, op=0x0, comm=0x1) at pallreduce.c:107
#8  0x2af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005", 
recvbuf=0x2 , count=0x0, 
datatype=0x, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at 
pallreduce_f.c:87
#9  0x0045ecc6 in m_sum_i_ ()
#10 0x00e172c9 in mlwf_mp_mlwf_wannier90_ ()
#11 0x004325ff in vamp () at main.F:2640
#12 0x0040de1e in main ()
#13 0x2af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
#14 0x0040dd29 in _start ()

On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org wrote:
> Hi Christof
> 
> Sorry if I missed this, but it sounds like you are saying that one of your 
> procs abnormally terminates, and we are failing to kill the remaining job? Is 
> that correct?
> 
> If so, I just did some work that might relate to that problem that is pending 
> in PR #2528: https://github.com/open-mpi/ompi/pull/2528 
> 
> 
> Would you be able to try that?
> 
> Ralph
> 
> > On Dec 7, 2016, at 9:37 AM, Christof Koehler 
> >  wrote:
> > 
> > Hello,
> > 
> > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler 
> >>>  wrote:
>  
> >>> I really think the hang is a consequence of
> >>> unclean termination (in the sense that the non-root ranks are not
> >>> terminated) and probably not the cause, in my interpretation of what I
> >>> see. Would you have any suggestion to catch signals sent between orterun
> >>> (mpirun) and the child tasks ?
> >> 
> >> Do you know where in the code the termination call is?  Is it actually 
> >> calling mpi_abort(), or just doing something ugly like calling fortran 
> >> “stop”?  If the latter, would that explain a possible hang?
> > Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 
> > input contains
> > an error, a restart is requested and the wannier90.chk file the restart
> > information is missing.
> > "
> > Exiting...
> > Error: restart requested but wannier90.chk file not found
> > "
> > So it must terminate.
> > 
> > The termination happens in the libwannier.a, source file io.F90:
> > 
> > write(stdout,*)  'Exiting...'
> > write(stdout, '(1x,a)') trim(error_msg)
> > close(stdout)
> > sto

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Gilles Gouaillardet

Christof,


There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)

in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not
using the library you think you use
pmap  will show you which lib is used

btw, this was not started with
mpirun --mca coll ^tuned ...
right ?

just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of
a feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun
never completes.
did i get it right ?

Cheers,

Gilles

On Thursday, December 8, 2016, Christof Koehler <
christof.koeh...@bccms.uni-bremen.de> wrote:

> Hello everybody,
>
> I tried it with the nightly and the direct 2.0.2 branch from git which
> according to the log should contain that patch
>
> commit d0b97d7a408b87425ca53523de369da405358ba2
> Merge: ac8c019 b9420bb
> Author: Jeff Squyres >
> Date:   Wed Dec 7 18:24:46 2016 -0500
> Merge pull request #2528 from rhc54/cmr20x/signals
>
> Unfortunately it changes nothing. The root rank stops and all other
> ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
> apparently in that allreduce. The stack trace looks a bit more
> interesting (git is always debug build ?), so I include it at the very
> bottom just in case.
>
> Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
> __exit etc. to try to catch signals. Would that be useful ? I need a
> moment to figure out how to do this, but I can definitively try.
>
> Some remark: During "make install" from the git repo I see a
>
> WARNING!  Common symbols found:
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_2complex
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_2double_complex
>   mpi-f08-types.o: 0004 C
> ompi_f08_mpi_2double_precision
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_2integer
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_2real
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_aint
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_band
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_bor
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_bxor
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_byte
>
> I have never noticed this before.
>
>
> Best Regards
>
> Christof
>
> Thread 1 (Thread 0x2af84cde4840 (LWP 11219)):
> #0  0x2af84e4c669d in poll () from /lib64/libc.so.6
> #1  0x2af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/
> intel2016/lib/libopen-pal.so.20
> #2  0x2af85050ffa5 in opal_libevent2022_event_base_loop () from
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> #3  0x2af85049fa1f in opal_progress () at runtime/opal_progress.c:207
> #4  0x2af84e02f7f7 in ompi_request_default_wait_all (count=233618144,
> requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
> #5  0x2af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling
> (sbuf=0xdecbae0,
> rbuf=0x2, count=0, dtype=0x, op=0x0, comm=0x1,
> module=0xdee69e0) at base/coll_base_allreduce.c:225
> #6  0x2af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed
> (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0x, op=0x0,
> comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66
> #7  0x2af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2,
> count=0, datatype=0x, op=0x0, comm=0x1) at pallreduce.c:107
> #8  0x2af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005",
> recvbuf=0x2 , count=0x0,
> datatype=0x, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at
> pallreduce_f.c:87
> #9  0x0045ecc6 in m_sum_i_ ()
> #10 0x00e172c9 in mlwf_mp_mlwf_wannier90_ ()
> #11 0x004325ff in vamp () at main.F:2640
> #12 0x0040de1e in main ()
> #13 0x2af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
> #14 0x0040dd29 in _start ()
>
> On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org 
> wrote:
> > Hi Christof
> >
> > Sorry if I missed this, but it sounds like you are saying that one of
> your procs abnormally terminates, and we are failing to kill the remaining
> job? Is that correct?
> >
> > If so, I just did some work that might relate to that problem that is
> pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <
> https://github.com/open-mpi/ompi/pull/2528>
> >
> > Would you be able to try that?
> >
> > Ralph
> >
> > > On Dec 7, 2016, at 9:37 AM, Christof Koehler <
> christof.koeh...@bccms.uni-bremen.de > wrote:
> > >
> > > Hello,
> > >
> > > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> > >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler <
> christof.koeh...@bccms.uni-bremen.de > wrote:
> > 
> > >>> I really think the hang is a consequence of
> > >>> uncle

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler

Hello,

On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote:
> Christof,
> 
> 
> There is something really odd with this stack trace.
> count is zero, and some pointers do not point to valid addresses (!)
Yes, I assumed it was interesting :-) Note that the program is compiled
with   -O2 -fp-model source, so optimization is on. I can try with -O0
or the gcc/gfortran ( will take a moment) to make sure it is not a
problem from that.

> 
> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> the stack has been corrupted inside MPI_Allreduce(), or that you are not
> using the library you think you use
> pmap  will show you which lib is used
The pmap of the survivor is at the very end of this mail.

> 
> btw, this was not started with
> mpirun --mca coll ^tuned ...
> right ?
This is correct, not started with "mpirun --mca coll ^tuned". Using it
does not change something.

> 
> just to make it clear ...
> a task from your program bluntly issues a fortran STOP, and this is kind of
> a feature.
Yes. The library where the stack occurs is/was written for serial use as
far as I can tell. As I mentioned, it is not our code but this one
http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which 
should
be a working combination.

> the *only* issue is mpirun does not kill the other MPI tasks and mpirun
> never completes.
> did i get it right ?
Yes ! So it is not a really big problem IMO. Just a bit nasty if this
would happen with a job in the queueing system.

Best Regards

Christof

Note: git branch 2.0.2 of openmpi was configured and installed (make
install) with
./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
--with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
--prefix=/cluster/mpi/openmpi/2.0.2/intel2016

The OS is Centos 7, relatively current :-) with current Omni-Path driver
package from Intel (10.2).

vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
(trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
course the libwannier.a version 1.2 statically linked.

pmap -p of the survivor

32282:   /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
0040  65200K r-x-- 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
045ab000100K r 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
045c4000   2244K rw--- 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
047f5000 100900K rw---   [ anon ]
0bfaa000684K rw---   [ anon ]
0c055000 20K rw---   [ anon ]
0c05a000424K rw---   [ anon ]
0c0c4000 68K rw---   [ anon ]
0c0d5000  25384K rw---   [ anon ]
2b17e34f6000132K r-x-- /usr/lib64/ld-2.17.so
2b17e3517000  4K rw---   [ anon ]
2b17e3518000 28K rw-s- /dev/infiniband/uverbs0
2b17e3523000 88K rw---   [ anon ]
2b17e3539000772K rw-s- /dev/infiniband/uverbs0
2b17e35fa000772K rw-s- /dev/infiniband/uverbs0
2b17e36bb000196K rw-s- /dev/infiniband/uverbs0
2b17e36ec000 28K rw-s- /dev/infiniband/uverbs0
2b17e36f3000 20K rw-s- /dev/infiniband/uverbs0
2b17e3717000  4K r /usr/lib64/ld-2.17.so
2b17e3718000  4K rw--- /usr/lib64/ld-2.17.so
2b17e3719000  4K rw---   [ anon ]
2b17e371a000 88K r-x-- /usr/lib64/libpthread-2.17.so
2b17e373   2048K - /usr/lib64/libpthread-2.17.so
2b17e393  4K r /usr/lib64/libpthread-2.17.so
2b17e3931000  4K rw--- /usr/lib64/libpthread-2.17.so
2b17e3932000 16K rw---   [ anon ]
2b17e3936000   1028K r-x-- /usr/lib64/libm-2.17.so
2b17e3a37000   2044K - /usr/lib64/libm-2.17.so
2b17e3c36000  4K r /usr/lib64/libm-2.17.so
2b17e3c37000  4K rw--- /usr/lib64/libm-2.17.so
2b17e3c38000 12K r-x-- /usr/lib64/libdl-2.17.so
2b17e3c3b000   2044K - /usr/lib64/libdl-2.17.so
2b17e3e3a000  4K r /usr/lib64/libdl-2.17.so
2b17e3e3b000  4K rw--- /usr/lib64/libdl-2.17.so
2b17e3e3c000184K r-x-- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
2b17e3e6a000   2044K - 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
2b17e4069000  4K r 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
2b17e406a000  4K rw--- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
2b17e406b000 36K r-x-- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
2b17e4074000   2044K - 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
2b17e4273000  4K r 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
2b17e4274000  4K rw--- 
/cluster/mpi/openmpi/2.0.2/

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler

Hello  again,

I am still not sure about breakpoints. But I did a "catch signal" in
gdb, gdb's were attached to the two vasp processes and mpirun.

When the root rank exits I see in the gdb attaching to it
[Thread 0x2b2787df8700 (LWP 2457) exited]
[Thread 0x2b277f483180 (LWP 2455) exited]
[Inferior 1 (process 2455) exited normally]

In the gdb attached to the mpirun
Catchpoint 1 (signal SIGCHLD), 0x2b16560f769d in poll () from
/lib64/libc.so.6

In the gdb attached to the second rank I see no output.

Issuing "continue" in the gdb session attached to mpi run does not lead
to anything new as far as I can tell.

The stack trace of the mpirun after that (Ctrl-C'ed to stop it again) is
#0  0x2b16560f769d in poll () from /lib64/libc.so.6
#1  0x2b1654b3a496 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#2  0x2b1654b32fa5 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3  0x00406311 in orterun (argc=7, argv=0x7ffdabfbebc8) at
orterun.c:1071
#4  0x004037e0 in main (argc=7, argv=0x7ffdabfbebc8) at
main.c:13

So there is a signal and mpirun does nothing with it ?

Cheers

Christof


On Thu, Dec 08, 2016 at 12:39:06PM +0100, Christof Koehler wrote:
> Hello,
> 
> On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote:
> > Christof,
> > 
> > 
> > There is something really odd with this stack trace.
> > count is zero, and some pointers do not point to valid addresses (!)
> Yes, I assumed it was interesting :-) Note that the program is compiled
> with   -O2 -fp-model source, so optimization is on. I can try with -O0
> or the gcc/gfortran ( will take a moment) to make sure it is not a
> problem from that.
> 
> > 
> > in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> > the stack has been corrupted inside MPI_Allreduce(), or that you are not
> > using the library you think you use
> > pmap  will show you which lib is used
> The pmap of the survivor is at the very end of this mail.
> 
> > 
> > btw, this was not started with
> > mpirun --mca coll ^tuned ...
> > right ?
> This is correct, not started with "mpirun --mca coll ^tuned". Using it
> does not change something.
> 
> > 
> > just to make it clear ...
> > a task from your program bluntly issues a fortran STOP, and this is kind of
> > a feature.
> Yes. The library where the stack occurs is/was written for serial use as
> far as I can tell. As I mentioned, it is not our code but this one
> http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which 
> should
> be a working combination.
> 
> > the *only* issue is mpirun does not kill the other MPI tasks and mpirun
> > never completes.
> > did i get it right ?
> Yes ! So it is not a really big problem IMO. Just a bit nasty if this
> would happen with a job in the queueing system.
> 
> Best Regards
> 
> Christof
> 
> Note: git branch 2.0.2 of openmpi was configured and installed (make
> install) with
> ./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
> precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
> FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
> --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
> --prefix=/cluster/mpi/openmpi/2.0.2/intel2016
> 
> The OS is Centos 7, relatively current :-) with current Omni-Path driver
> package from Intel (10.2).
> 
> vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
> (trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
> course the libwannier.a version 1.2 statically linked.
> 
> pmap -p of the survivor
> 
> 32282:   /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 0040  65200K r-x-- 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 045ab000100K r 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 045c4000   2244K rw--- 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 047f5000 100900K rw---   [ anon ]
> 0bfaa000684K rw---   [ anon ]
> 0c055000 20K rw---   [ anon ]
> 0c05a000424K rw---   [ anon ]
> 0c0c4000 68K rw---   [ anon ]
> 0c0d5000  25384K rw---   [ anon ]
> 2b17e34f6000132K r-x-- /usr/lib64/ld-2.17.so
> 2b17e3517000  4K rw---   [ anon ]
> 2b17e3518000 28K rw-s- /dev/infiniband/uverbs0
> 2b17e3523000 88K rw---   [ anon ]
> 2b17e3539000772K rw-s- /dev/infiniband/uverbs0
> 2b17e35fa000772K rw-s- /dev/infiniband/uverbs0
> 2b17e36bb000196K rw-s- /dev/infiniband/uverbs0
> 2b17e36ec000 28K rw-s- /dev/infiniband/uverbs0
> 2b17e36f3000 20K rw-s- /dev/infiniband/uverbs0
> 2b17e3717000  4K r /usr/lib64/ld-2.17.so
> 2b17e3718000  4K rw--- /usr/lib64/ld-2.17.so
> 2b17e3719000  4K rw---   [ anon ]
> 2b17e371a000 88K r-x-- /usr/lib64/libpthrea

[OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Daniele Tartarini

Hi,

I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:

*openmpi-devel.x86_64 1.10.3-3.el7  *

any code I try to run (including the mpitests-*) I get the following
message with slight variants:

* my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
failed to appear after 15.0 seconds: Connection timed out*

Is anyone able to help me in identifying the source of the problem?
Anyway, * /dev/hfi1_0* doesn't exist.

If I use an OpenMPI version compiled from source I have no issue (gcc
4.8.5).

many thanks in advance.

cheers
Daniele
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

2016-12-08 Thread Dave Love

Jeff Hammond  writes:

>>
>>
>> > Note that MPI implementations may be interested in taking advantage of
>> > https://software.intel.com/en-us/blogs/2016/10/06/intel-
>> xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
>>
>> Is that really useful if it's KNL-specific and MSR-based, with a setup
>> that implementations couldn't assume?
>>
>>
> Why wouldn't it be useful in the context of a parallel runtime system like
> MPI?  MPI implementations take advantage of all sorts of stuff that needs
> to be queried with configuration, during compilation or at runtime.

I probably should have said "useful in practice".  The difference from
other things I can think of is that access to MSRs is privileged, and
it's not clear to me what the implications are of changing it or to what
extent you can assume people will.

> TSX requires that one check the CPUID bits for it, and plenty of folks are
> happily using MSRs (e.g.
> http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html).

Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] MPI+OpenMP core binding redux

2016-12-08 Thread Dave Love

I think there was a suggestion that the SC16 material would explain how
to get appropriate core binding for MPI+OpenMP (i.e. OMP_NUM_THREADS
cores/process), but it doesn't as far as I can see.

Could someone please say how you're supposed to do that in recent
versions (without relying on bound DRM slots), and provide a working
example in the documentation?  It seems a fairly important case that
should be clear.  Thanks.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread r...@open-mpi.org

Sounds like something didn’t quite get configured right, or maybe you have a 
library installed that isn’t quite setup correctly, or...

Regardless, we generally advise building from source to avoid such problems. Is 
there some reason not to just do so?

> On Dec 8, 2016, at 6:16 AM, Daniele Tartarini  
> wrote:
> 
> Hi,
> 
> I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:
> 
> openmpi-devel.x86_64 1.10.3-3.el7  
> 
> any code I try to run (including the mpitests-*) I get the following message 
> with slight variants:
> 
>  my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device failed 
> to appear after 15.0 seconds: Connection timed out
> 
> Is anyone able to help me in identifying the source of the problem?
> Anyway,  /dev/hfi1_0 doesn't exist.
> 
> If I use an OpenMPI version compiled from source I have no issue (gcc 4.8.5).
> 
> many thanks in advance.
> 
> cheers
> Daniele
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread r...@open-mpi.org

To the best I can determine, mpirun catches SIGTERM just fine and will hit the 
procs with SIGCONT, followed by SIGTERM and then SIGKILL. It will then wait to 
see the remote daemons complete after they hit their procs with the same 
sequence.


> On Dec 8, 2016, at 5:18 AM, Christof Koehler 
>  wrote:
> 
> Hello  again,
> 
> I am still not sure about breakpoints. But I did a "catch signal" in
> gdb, gdb's were attached to the two vasp processes and mpirun.
> 
> When the root rank exits I see in the gdb attaching to it
> [Thread 0x2b2787df8700 (LWP 2457) exited]
> [Thread 0x2b277f483180 (LWP 2455) exited]
> [Inferior 1 (process 2455) exited normally]
> 
> In the gdb attached to the mpirun
> Catchpoint 1 (signal SIGCHLD), 0x2b16560f769d in poll () from
> /lib64/libc.so.6
> 
> In the gdb attached to the second rank I see no output.
> 
> Issuing "continue" in the gdb session attached to mpi run does not lead
> to anything new as far as I can tell.
> 
> The stack trace of the mpirun after that (Ctrl-C'ed to stop it again) is
> #0  0x2b16560f769d in poll () from /lib64/libc.so.6
> #1  0x2b1654b3a496 in poll_dispatch () from
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> #2  0x2b1654b32fa5 in opal_libevent2022_event_base_loop () from
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> #3  0x00406311 in orterun (argc=7, argv=0x7ffdabfbebc8) at
> orterun.c:1071
> #4  0x004037e0 in main (argc=7, argv=0x7ffdabfbebc8) at
> main.c:13
> 
> So there is a signal and mpirun does nothing with it ?
> 
> Cheers
> 
> Christof
> 
> 
> On Thu, Dec 08, 2016 at 12:39:06PM +0100, Christof Koehler wrote:
>> Hello,
>> 
>> On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote:
>>> Christof,
>>> 
>>> 
>>> There is something really odd with this stack trace.
>>> count is zero, and some pointers do not point to valid addresses (!)
>> Yes, I assumed it was interesting :-) Note that the program is compiled
>> with   -O2 -fp-model source, so optimization is on. I can try with -O0
>> or the gcc/gfortran ( will take a moment) to make sure it is not a
>> problem from that.
>> 
>>> 
>>> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
>>> the stack has been corrupted inside MPI_Allreduce(), or that you are not
>>> using the library you think you use
>>> pmap  will show you which lib is used
>> The pmap of the survivor is at the very end of this mail.
>> 
>>> 
>>> btw, this was not started with
>>> mpirun --mca coll ^tuned ...
>>> right ?
>> This is correct, not started with "mpirun --mca coll ^tuned". Using it
>> does not change something.
>> 
>>> 
>>> just to make it clear ...
>>> a task from your program bluntly issues a fortran STOP, and this is kind of
>>> a feature.
>> Yes. The library where the stack occurs is/was written for serial use as
>> far as I can tell. As I mentioned, it is not our code but this one
>> http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which 
>> should
>> be a working combination.
>> 
>>> the *only* issue is mpirun does not kill the other MPI tasks and mpirun
>>> never completes.
>>> did i get it right ?
>> Yes ! So it is not a really big problem IMO. Just a bit nasty if this
>> would happen with a job in the queueing system.
>> 
>> Best Regards
>> 
>> Christof
>> 
>> Note: git branch 2.0.2 of openmpi was configured and installed (make
>> install) with
>> ./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
>> precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
>> FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
>> --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
>> --prefix=/cluster/mpi/openmpi/2.0.2/intel2016
>> 
>> The OS is Centos 7, relatively current :-) with current Omni-Path driver
>> package from Intel (10.2).
>> 
>> vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
>> (trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
>> course the libwannier.a version 1.2 statically linked.
>> 
>> pmap -p of the survivor
>> 
>> 32282:   /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 0040  65200K r-x-- 
>> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 045ab000100K r 
>> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 045c4000   2244K rw--- 
>> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 047f5000 100900K rw---   [ anon ]
>> 0bfaa000684K rw---   [ anon ]
>> 0c055000 20K rw---   [ anon ]
>> 0c05a000424K rw---   [ anon ]
>> 0c0c4000 68K rw---   [ anon ]
>> 0c0d5000  25384K rw---   [ anon ]
>> 2b17e34f6000132K r-x-- /usr/lib64/ld-2.17.so
>> 2b17e3517000  4K rw---   [ anon ]
>> 2b17e3518000 28K rw-s- /dev/infiniband/uverbs0
>> 2b17e3523000 88K rw---   [ anon ]
>> 2b17e3539000772K rw-s- /dev/infiniband

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Howard Pritchard

hello Daniele,

Could you post the output from ompi_info command?  I'm noticing on the RPMS
that came with the rhel7.2 distro on
one of our systems that it was built to support psm2/hfi-1.

Two things, could you try running applications with

mpirun --mca pml ob1 (all the rest of your args)

and see if that works?

Second,  what sort of system are you using?  Is this a cluster?  If it is,
you may want to check whether
you have a situation where its an omnipath interconnect and you have the
psm2/hfi1 packages installed
but for some reason the omnipath HCAs themselves are not active.

On one of our omnipath systems the following hfi1 related pms are installed:

*hfi*diags-0.8-13.x86_64

*hfi*1-psm-devel-0.7-244.x86_64
lib*hfi*1verbs-0.5-16.el7.x86_64
*hfi*1-psm-0.7-244.x86_64
*hfi*1-firmware-0.9-36.noarch
*hfi*1-psm-compat-0.7-244.x86_64
lib*hfi*1verbs-devel-0.5-16.el7.x86_64
*hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
*hfi*1-firmware_debug-0.9-36.noarc
*hfi*1-diagtools-sw-0.8-13.x86_64


Howard

2016-12-08 8:45 GMT-07:00 r...@open-mpi.org :

> Sounds like something didn’t quite get configured right, or maybe you have
> a library installed that isn’t quite setup correctly, or...
>
> Regardless, we generally advise building from source to avoid such
> problems. Is there some reason not to just do so?
>
> On Dec 8, 2016, at 6:16 AM, Daniele Tartarini 
> wrote:
>
> Hi,
>
> I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:
>
> *openmpi-devel.x86_64 1.10.3-3.el7  *
>
> any code I try to run (including the mpitests-*) I get the following
> message with slight variants:
>
> * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
> failed to appear after 15.0 seconds: Connection timed out*
>
> Is anyone able to help me in identifying the source of the problem?
> Anyway, * /dev/hfi1_0* doesn't exist.
>
> If I use an OpenMPI version compiled from source I have no issue (gcc
> 4.8.5).
>
> many thanks in advance.
>
> cheers
> Daniele
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Cabral, Matias A

>Anyway,  /dev/hfi1_0 doesn't exist.
Make sure you have the hfi1 module/driver loaded.
In addition, please confirm the links are in active state on all the nodes 
`opainfo`

_MAC

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Howard 
Pritchard
Sent: Thursday, December 08, 2016 9:23 AM
To: Open MPI Users 
Subject: Re: [OMPI users] device failed to appear .. Connection timed out

hello Daniele,

Could you post the output from ompi_info command?  I'm noticing on the RPMS 
that came with the rhel7.2 distro on
one of our systems that it was built to support psm2/hfi-1.

Two things, could you try running applications with

mpirun --mca pml ob1 (all the rest of your args)

and see if that works?

Second,  what sort of system are you using?  Is this a cluster?  If it is, you 
may want to check whether
you have a situation where its an omnipath interconnect and you have the 
psm2/hfi1 packages installed
but for some reason the omnipath HCAs themselves are not active.

On one of our omnipath systems the following hfi1 related pms are installed:

hfidiags-0.8-13.x86_64

hfi1-psm-devel-0.7-244.x86_64
libhfi1verbs-0.5-16.el7.x86_64
hfi1-psm-0.7-244.x86_64
hfi1-firmware-0.9-36.noarch
hfi1-psm-compat-0.7-244.x86_64
libhfi1verbs-devel-0.5-16.el7.x86_64
hfi1-0.11.3.10.0_327.el7.x86_64-245.x86_64
hfi1-firmware_debug-0.9-36.noarc
hfi1-diagtools-sw-0.8-13.x86_64



Howard

2016-12-08 8:45 GMT-07:00 r...@open-mpi.org 
mailto:r...@open-mpi.org>>:
Sounds like something didn’t quite get configured right, or maybe you have a 
library installed that isn’t quite setup correctly, or...

Regardless, we generally advise building from source to avoid such problems. Is 
there some reason not to just do so?

On Dec 8, 2016, at 6:16 AM, Daniele Tartarini 
mailto:d.tartar...@sheffield.ac.uk>> wrote:

Hi,

I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:

openmpi-devel.x86_64 1.10.3-3.el7

any code I try to run (including the mpitests-*) I get the following message 
with slight variants:

 my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device failed to 
appear after 15.0 seconds: Connection timed out

Is anyone able to help me in identifying the source of the problem?
Anyway,  /dev/hfi1_0 doesn't exist.

If I use an OpenMPI version compiled from source I have no issue (gcc 4.8.5).

many thanks in advance.

cheers
Daniele
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Noam Bernstein

> On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet 
>  wrote:
> 
> Christof,
> 
> 
> There is something really odd with this stack trace.
> count is zero, and some pointers do not point to valid addresses (!)
> 
> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> the stack has been corrupted inside MPI_Allreduce(), or that you are not 
> using the library you think you use
> pmap  will show you which lib is used
> 
> btw, this was not started with
> mpirun --mca coll ^tuned ...
> right ?
> 
> just to make it clear ...
> a task from your program bluntly issues a fortran STOP, and this is kind of a 
> feature.
> the *only* issue is mpirun does not kill the other MPI tasks and mpirun never 
> completes.
> did i get it right ?

I just ran across very similar behavior in VASP (which we just switched over to 
openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call one, 
others call the other), and I discovered several interesting things.

The most important is that when MPI is active, the preprocessor converts (via a 
#define in symbol.inc) fortran STOP into calls to m_exit() (defined in mpi.F), 
which is a wrapper around mpi_finalize.  So in my case some processes in the 
communicator call mpi_finalize, others call mpi_allreduce.  I’m not really 
surprised this hangs, because I think the correct thing to replace STOP with is 
mpi_abort, not mpi_finalize.  If you know where the STOP is called, you can 
check the preprocessed equivalent file (.f90 instead of .F), and see if it’s 
actually been replaced with a call to m_exit.  I’m planning to test whether 
replacing m_exit with m_stop in symbol.inc gives more sensible behavior, i.e. 
program termination when the original source file executes a STOP.

I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected to 
hang, but just in case that’s surprising, here are my stack traces:


hung in collective:

(gdb) where
#0  0x2b8d5a095ec6 in opal_progress () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
#1  0x2b8d59b3a36d in ompi_request_default_wait_all () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#2  0x2b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling () 
from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3  0x2b8d59b495ac in PMPI_Allreduce () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#4  0x2b8d598e4027 in pmpi_allreduce__ () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#5  0x00414077 in m_sum_i (comm=..., ivec=warning: Range for type 
(null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
..., n=2) at mpi.F:989
#6  0x00daac54 in full_kpoints::set_indpw_full (grid=..., wdes=..., 
kpoints_f=...) at mkpoints_full.F:1099
#7  0x01441654 in set_indpw_fock (t_info=..., p=warning: Range for type 
(null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1669
#8  fock::setup_fock (t_info=..., p=warning: Range for type (null) has invalid 
bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1413
#9  0x02976478 in vamp () at main.F:2093
#10 0x00412f9e in main ()
#11 0x00383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x00412ea9 in _start ()

hung in mpi_finalize:

#0  0x00383a4acbdd in nanosleep () from /lib64/libc.so.6
#1  0x00383a4e1d94 in usleep () from /lib64/libc.so.6
#2  0x2b11db1e0ae7 in ompi_mpi_finalize () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3  0x2b11daf8b399 in pmpi_finalize__ () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#4  0x004199c5 in m_exit () at mpi.F:375
#5  0x00dab17f in full_kpoints::set_indpw_full (grid=..., w

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Daniele Tartarini

Hi Howard,

many thanks for your reply:

On 8 December 2016 at 17:22, Howard Pritchard  wrote:

> hello Daniele,
>
> Could you post the output from ompi_info command?  I'm noticing on the
> RPMS that came with the rhel7.2 distro on
> one of our systems that it was built to support psm2/hfi-1.
>
>
please find attached the optput of ompi_info

> Two things, could you try running applications with
>
> mpirun --mca pml ob1 (all the rest of your args)
>
> and see if that works?
>

It works without complaining!

> Second,  what sort of system are you using?  Is this a cluster?  If it is,
> you may want to check whether
> you have a situation where its an omnipath interconnect and you have the
> psm2/hfi1 packages installed
> but for some reason the omnipath HCAs themselves are not active.
>
> On one of our omnipath systems the following hfi1 related pms are
> installed:
>
> *hfi*diags-0.8-13.x86_64
>
> *hfi*1-psm-devel-0.7-244.x86_64
> lib*hfi*1verbs-0.5-16.el7.x86_64
> *hfi*1-psm-0.7-244.x86_64
> *hfi*1-firmware-0.9-36.noarch
> *hfi*1-psm-compat-0.7-244.x86_64
> lib*hfi*1verbs-devel-0.5-16.el7.x86_64
> *hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
> *hfi*1-firmware_debug-0.9-36.noarc
> *hfi*1-diagtools-sw-0.8-13.x86_64
>
>
> The machine is a dual processor with  attached (GPUs and) Intel Xeon Phi.
The Mpss 3.7 is installed.
The Xeon Phi is a 3120A (Knights Corner), so should be without omni-path.

I have no hfi package installed but I have the

libpsm2.x86_64   10.2.33-1.el7

any idea?

cheers
daniele

> Howard
>
> 2016-12-08 8:45 GMT-07:00 r...@open-mpi.org :
>
>> Sounds like something didn’t quite get configured right, or maybe you
>> have a library installed that isn’t quite setup correctly, or...
>>
>> Regardless, we generally advise building from source to avoid such
>> problems. Is there some reason not to just do so?
>>
>> On Dec 8, 2016, at 6:16 AM, Daniele Tartarini <
>> d.tartar...@sheffield.ac.uk> wrote:
>>
>> Hi,
>>
>> I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:
>>
>> *openmpi-devel.x86_64 1.10.3-3.el7  *
>>
>> any code I try to run (including the mpitests-*) I get the following
>> message with slight variants:
>>
>> * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
>> failed to appear after 15.0 seconds: Connection timed out*
>>
>> Is anyone able to help me in identifying the source of the problem?
>> Anyway, * /dev/hfi1_0* doesn't exist.
>>
>> If I use an OpenMPI version compiled from source I have no issue (gcc
>> 4.8.5).
>>
>> many thanks in advance.
>>
>> cheers
>> Daniele
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>

-- 
--
Daniele Tartarini

Post-Doctoral Research Associate
Dept. Mechanical Engineering &
INSIGNEO, institute for *in silico* medicine,
University of Sheffield, Sheffield, UK
linkedIn 
$ ompi_info
 Package: Open MPI mockbuild@ Distribution
Open MPI: 1.10.3
  Open MPI repo revision: v1.10.2-251-g9acf492
   Open MPI release date: Jun 14, 2016
Open RTE: 1.10.3
  Open RTE repo revision: v1.10.2-251-g9acf492
   Open RTE release date: Jun 14, 2016
OPAL: 1.10.3
  OPAL repo revision: v1.10.2-251-g9acf492
   OPAL release date: Jun 14, 2016
 MPI API: 3.0.0
Ident string: 1.10.3
  Prefix: /usr/lib64/openmpi
 Configured architecture: x86_64-pc-linux-gnu
  Configure host:
   Configured by: mockbuild
   Configured on: Fri Aug  5 07:44:11 EDT 2016
  Configure host:
Built by: mockbuild
Built on: Fri Aug  5 07:48:35 EDT 2016
  Built host:
  C bindings: yes
C++ bindings: yes
 Fort mpif.h: yes (all)
Fort use mpi: yes (limited: overloading)
   Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: no
 Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
   Java bindings: no
  Wrapper compiler rpath: runpath
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
  C compiler version: 4.8.5
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
   Fort compiler: gfortran
   Fort compiler abs: /usr/bin/gfortran
 Fort ignore TKR: no
   Fort 08 assumed shape: no
  Fort optional args: no
  Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
   Fort S

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Daniele Tartarini

Hi,
many thanks for tour reply.

I have a S2600IP Intel motherboard. it is a stand alone server and I cannot
see any omnipath device and so not such modules.
opainfo is not available on my system

missing anything?
cheers
Daniele

On 8 December 2016 at 17:55, Cabral, Matias A 
wrote:

> >Anyway, * /dev/hfi1_0* doesn't exist.
>
> Make sure you have the hfi1 module/driver loaded.
>
> In addition, please confirm the links are in active state on all the nodes
> `opainfo`
>
>
>
> _MAC
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Thursday, December 08, 2016 9:23 AM
> *To:* Open MPI Users 
> *Subject:* Re: [OMPI users] device failed to appear .. Connection timed
> out
>
>
>
> hello Daniele,
>
>
>
> Could you post the output from ompi_info command?  I'm noticing on the
> RPMS that came with the rhel7.2 distro on
>
> one of our systems that it was built to support psm2/hfi-1.
>
>
>
> Two things, could you try running applications with
>
>
>
> mpirun --mca pml ob1 (all the rest of your args)
>
>
>
> and see if that works?
>
>
>
> Second,  what sort of system are you using?  Is this a cluster?  If it is,
> you may want to check whether
>
> you have a situation where its an omnipath interconnect and you have the
> psm2/hfi1 packages installed
>
> but for some reason the omnipath HCAs themselves are not active.
>
>
>
> On one of our omnipath systems the following hfi1 related pms are
> installed:
>
>
>
> *hfi*diags-0.8-13.x86_64
>
> *hfi*1-psm-devel-0.7-244.x86_64
> lib*hfi*1verbs-0.5-16.el7.x86_64
> *hfi*1-psm-0.7-244.x86_64
> *hfi*1-firmware-0.9-36.noarch
> *hfi*1-psm-compat-0.7-244.x86_64
> lib*hfi*1verbs-devel-0.5-16.el7.x86_64
> *hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
> *hfi*1-firmware_debug-0.9-36.noarc
> *hfi*1-diagtools-sw-0.8-13.x86_64
>
>
>
> Howard
>
>
>
> 2016-12-08 8:45 GMT-07:00 r...@open-mpi.org :
>
> Sounds like something didn’t quite get configured right, or maybe you have
> a library installed that isn’t quite setup correctly, or...
>
>
>
> Regardless, we generally advise building from source to avoid such
> problems. Is there some reason not to just do so?
>
>
>
> On Dec 8, 2016, at 6:16 AM, Daniele Tartarini 
> wrote:
>
>
>
> Hi,
>
> I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:
>
> *openmpi-devel.x86_64 1.10.3-3.el7  *
>
>
>
> any code I try to run (including the mpitests-*) I get the following
> message with slight variants:
>
>
>
> * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
> failed to appear after 15.0 seconds: Connection timed out*
>
>
>
> Is anyone able to help me in identifying the source of the problem?
>
> Anyway, * /dev/hfi1_0* doesn't exist.
>
>
>
> If I use an OpenMPI version compiled from source I have no issue (gcc
> 4.8.5).
>
>
>
> many thanks in advance.
>
>
>
> cheers
>
> Daniele
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 
--
Daniele Tartarini

Post-Doctoral Research Associate
Dept. Mechanical Engineering &
INSIGNEO, institute for *in silico* medicine,
University of Sheffield, Sheffield, UK
linkedIn 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Howard Pritchard

Hi Daniele,

I bet this psm2 got installed as part of Mpss 3.7.  I see something in the
readme for that about MPSS install with OFED support.
I think if you want to go the route of using the RHEL Open MPI RPMS, you
could use the mca-params.conf file approach
to disabling the use of psm2.

This file and a lot of other stuff about mca parameters is described here:

https://www.open-mpi.org/faq/?category=tuning

Alternatively, you could try and build/install Open MPI yourself from the
download page:

https://www.open-mpi.org/software/ompi/v1.10/

The simplest solution - but you need to be confident that nothing's using
the PSM2 software - would be just
use yum to deinstall the psm2 rpm.

Good luck,

Howard




2016-12-08 14:17 GMT-07:00 Daniele Tartarini :

> Hi,
> many thanks for tour reply.
>
> I have a S2600IP Intel motherboard. it is a stand alone server and I
> cannot see any omnipath device and so not such modules.
> opainfo is not available on my system
>
> missing anything?
> cheers
> Daniele
>
> On 8 December 2016 at 17:55, Cabral, Matias A 
> wrote:
>
>> >Anyway, * /dev/hfi1_0* doesn't exist.
>>
>> Make sure you have the hfi1 module/driver loaded.
>>
>> In addition, please confirm the links are in active state on all the
>> nodes `opainfo`
>>
>>
>>
>> _MAC
>>
>>
>>
>> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
>> Pritchard
>> *Sent:* Thursday, December 08, 2016 9:23 AM
>> *To:* Open MPI Users 
>> *Subject:* Re: [OMPI users] device failed to appear .. Connection timed
>> out
>>
>>
>>
>> hello Daniele,
>>
>>
>>
>> Could you post the output from ompi_info command?  I'm noticing on the
>> RPMS that came with the rhel7.2 distro on
>>
>> one of our systems that it was built to support psm2/hfi-1.
>>
>>
>>
>> Two things, could you try running applications with
>>
>>
>>
>> mpirun --mca pml ob1 (all the rest of your args)
>>
>>
>>
>> and see if that works?
>>
>>
>>
>> Second,  what sort of system are you using?  Is this a cluster?  If it
>> is, you may want to check whether
>>
>> you have a situation where its an omnipath interconnect and you have the
>> psm2/hfi1 packages installed
>>
>> but for some reason the omnipath HCAs themselves are not active.
>>
>>
>>
>> On one of our omnipath systems the following hfi1 related pms are
>> installed:
>>
>>
>>
>> *hfi*diags-0.8-13.x86_64
>>
>> *hfi*1-psm-devel-0.7-244.x86_64
>> lib*hfi*1verbs-0.5-16.el7.x86_64
>> *hfi*1-psm-0.7-244.x86_64
>> *hfi*1-firmware-0.9-36.noarch
>> *hfi*1-psm-compat-0.7-244.x86_64
>> lib*hfi*1verbs-devel-0.5-16.el7.x86_64
>> *hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
>> *hfi*1-firmware_debug-0.9-36.noarc
>> *hfi*1-diagtools-sw-0.8-13.x86_64
>>
>>
>>
>> Howard
>>
>>
>>
>> 2016-12-08 8:45 GMT-07:00 r...@open-mpi.org :
>>
>> Sounds like something didn’t quite get configured right, or maybe you
>> have a library installed that isn’t quite setup correctly, or...
>>
>>
>>
>> Regardless, we generally advise building from source to avoid such
>> problems. Is there some reason not to just do so?
>>
>>
>>
>> On Dec 8, 2016, at 6:16 AM, Daniele Tartarini <
>> d.tartar...@sheffield.ac.uk> wrote:
>>
>>
>>
>> Hi,
>>
>> I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:
>>
>> *openmpi-devel.x86_64 1.10.3-3.el7  *
>>
>>
>>
>> any code I try to run (including the mpitests-*) I get the following
>> message with slight variants:
>>
>>
>>
>> * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
>> failed to appear after 15.0 seconds: Connection timed out*
>>
>>
>>
>> Is anyone able to help me in identifying the source of the problem?
>>
>> Anyway, * /dev/hfi1_0* doesn't exist.
>>
>>
>>
>> If I use an OpenMPI version compiled from source I have no issue (gcc
>> 4.8.5).
>>
>>
>>
>> many thanks in advance.
>>
>>
>>
>> cheers
>>
>> Daniele
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
>
> --
> --
> Daniele Tartarini
>
> Post-Doctoral Research Associate
> Dept. Mechanical Engineering &
> INSIGNEO, institute for *in silico* medicine,
> University of Sheffield, Sheffield, UK
> linkedIn 
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] Abort/ Deadlock issue in allreduce

[OMPI users] device failed to appear .. Connection timed out

Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

[OMPI users] MPI+OpenMP core binding redux

Re: [OMPI users] device failed to appear .. Connection timed out

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] device failed to appear .. Connection timed out

Re: [OMPI users] device failed to appear .. Connection timed out

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Re: [OMPI users] device failed to appear .. Connection timed out

Re: [OMPI users] device failed to appear .. Connection timed out

Re: [OMPI users] device failed to appear .. Connection timed out

15 matches

Site Navigation

Mail list logo

Footer information