subject:"\[OMPI users\] Program hangs"

Re: [OMPI users] Program hangs when MPI_Bcast is called rapidly

2019-10-29 Thread George Bosilca via users

Charles, Having implemented some of the underlying collective algorithms, I am puzzled by the need to force the sync to 1 to have things flowing. I would definitely appreciate a reproducer so that I can identify (and hopefully) fix the underlying problem. Thanks, George. On Tue, Oct 29, 2019

Re: [OMPI users] Program hangs when MPI_Bcast is called rapidly

2019-10-29 Thread Garrett, Charles via users

Last time I did a reply on here, it created a new thread. Sorry about that everyone. I just hit the Reply via email button. Hopefully this one will work. To Gilles Gouaillardet: My first thread has a reproducer that causes the problem. To Beorge Bosilca: I had to set coll_sync_barrier_before=

Re: [OMPI users] Program hangs when MPI_Bcast is called rapidly

2019-10-29 Thread George Bosilca via users

Charles, There is a known issue with calling collectives on a tight loop, due to lack of control flow at the network level. It results in a significant slow-down, that might appear as a deadlock to users. The work around this is to enable the sync collective module, that will insert a fake barrier

Re: [OMPI users] Program hangs when MPI_Bcast is called rapidly

2019-10-28 Thread Gilles Gouaillardet via users

Charles, unless you expect yes or no answers, can you please post a simple program that evidences the issue you are facing ? Cheers, Gilles On 10/29/2019 6:37 AM, Garrett, Charles via users wrote: Does anyone have any idea why this is happening? Has anyone seen this problem before?

[OMPI users] Program hangs when MPI_Bcast is called rapidly

2019-10-28 Thread Garrett, Charles via users

Does anyone have any idea why this is happening? Has anyone seen this problem before?

[OMPI users] Program hangs when MPI_Bcast is called rapidly

2019-10-08 Thread Garrett, Charles via users

I have a problem where MPI_Bcast hangs when called rapidly over and over again. This problem manifests itself on our new cluster, but not on our older one. The new cluster has Cascade Lake processors. Each node contains 2 sockets with 18 cores per socket. Cluster size is 128 nodes with an ED

Re: [OMPI users] Program hangs in mpi_bcast

2011-12-09 Thread Alex A. Granovsky

> something else? Yes, this is with regards to collective hang issue. All the best, Alex - Original Message - From: "Jeff Squyres" To: "Alex A. Granovsky" ; Sent: Saturday, December 03, 2011 3:36 PM Subject: Re: [OMPI users] Program hangs in mpi_bcast

Re: [OMPI users] Program hangs in mpi_bcast

2011-12-03 Thread Jeff Squyres

On Dec 2, 2011, at 8:50 AM, Alex A. Granovsky wrote: >I would like to start discussion on implementation of collective > operations within OpenMPI. The reason for this is at least twofold. > Last months, there was the constantly growing number of messages in > the list sent by persons facing p

Re: [OMPI users] Program hangs in mpi_bcast

2011-12-02 Thread Alex A. Granovsky

ose to the hardware limits does not make us happy at all. Kind regards, Alex Granovsky - Original Message - From: "Jeff Squyres" To: "Open MPI Users" Sent: Wednesday, November 30, 2011 11:45 PM Subject: Re: [OMPI users] Program hangs in mpi_bcast > Fair enough. T

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-30 Thread Jeff Squyres

Fair enough. Thanks anyway! On Nov 30, 2011, at 3:39 PM, Tom Rosmond wrote: > Jeff, > > I'm afraid trying to produce a reproducer of this problem wouldn't be > worth the effort. It is a legacy code that I wasn't involved in > developing and will soon be discarded, so I can't justify spending t

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-30 Thread Tom Rosmond

Jeff, I'm afraid trying to produce a reproducer of this problem wouldn't be worth the effort. It is a legacy code that I wasn't involved in developing and will soon be discarded, so I can't justify spending time trying to understand its behavior better. The bottom line is that it works correctly

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-30 Thread Jeff Squyres

Yes, but I'd like to see a reproducer that requires setting the sync_barrier_before=5. Your reproducers allowed much higher values, IIRC. I'm curious to know what makes that code require such a low value (i.e., 5)... On Nov 30, 2011, at 1:50 PM, Ralph Castain wrote: > FWIW: we already have a

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-30 Thread Ralph Castain

Oh - and another one at orte/test/mpi/reduce-hang.c On Nov 30, 2011, at 11:50 AM, Ralph Castain wrote: > FWIW: we already have a reproducer from prior work I did chasing this down a > couple of years ago. See orte/test/mpi/bcast_loop.c > > > On Nov 29, 2011, at 9:35 AM, Jeff Squyres wrote: >

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-30 Thread Ralph Castain

FWIW: we already have a reproducer from prior work I did chasing this down a couple of years ago. See orte/test/mpi/bcast_loop.c On Nov 29, 2011, at 9:35 AM, Jeff Squyres wrote: > That's quite weird/surprising that you would need to set it down to *5* -- > that's really low. > > Can you share

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-29 Thread Jeff Squyres

That's quite weird/surprising that you would need to set it down to *5* -- that's really low. Can you share a simple reproducer code, perchance? On Nov 15, 2011, at 11:49 AM, Tom Rosmond wrote: > Ralph, > > Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do > the job. T

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-15 Thread Tom Rosmond

Ralph, Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do the job. This is a big change from the default value (1000), so our application seems to be a pretty extreme case. T. Rosmond On Mon, 2011-11-14 at 16:17 -0700, Ralph Castain wrote: > Yes, this is well documented -

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-14 Thread Ralph Castain

Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times. The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue. This causes the matching logic to run a little slower, thus making th

[OMPI users] Program hangs in mpi_bcast

2011-11-14 Thread Tom Rosmond

Hello: A colleague and I have been running a large F90 application that does an enormous number of mpi_bcast calls during execution. I deny any responsibility for the design of the code and why it needs these calls, but it is what we have inherited and have to work with. Recently we ported the c

Re: [OMPI users] Program hangs on send when run with nodes on remote machine

2011-08-04 Thread Jeff Squyres

I notice that in the worker, you have: eth2 Link encap:Ethernet HWaddr 00:1b:21:77:c5:d4 inet addr:192.168.1.155 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::21b:21ff:fe77:c5d4/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

[OMPI users] Program hangs on send when run with nodes on remote machine

2011-08-04 Thread Keith Manville

I am having trouble running my MPI program on multiple nodes. I can run multiple processes on a single node, and I can spawn processes on on remote nodes, but when I call Send from a remote node, the node never returns, even though there is an appropriate Recv waiting. I'm pretty sure this is an is

Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-06 Thread Fengguang Song

ers directly. There are > still many performance issues to be worked out, but just thought I would > mention it. > > > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Fengguang Song > Sent: Sunday, June 05,

Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-06 Thread Rolf vandeVaart

. -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Fengguang Song Sent: Sunday, June 05, 2011 9:54 AM To: Open MPI Users Subject: Re: [OMPI users] Program hangs when using OpenMPI and CUDA Hi Brice, Thank you! I saw your previous discussion

Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-05 Thread Fengguang Song

Hi Brice, Thank you! I saw your previous discussion and actually have tried "--mca btl_openib_flags 304". It didn't solve the problem unfortunately. In our case, the MPI buffer is different from the cudaMemcpy buffer and we do manually copy between them. I'm still trying to figure out how to co

Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-05 Thread Brice Goglin

Le 05/06/2011 00:15, Fengguang Song a écrit : > Hi, > > I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. My > program uses MPI to exchange data > between nodes, and uses cudaMemcpyAsync to exchange data between Host and GPU > devices within a node. > When the MPI message size

[OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-04 Thread Fengguang Song

Hi, I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. My program uses MPI to exchange data between nodes, and uses cudaMemcpyAsync to exchange data between Host and GPU devices within a node. When the MPI message size is less than 1MB, everything works fine. However, when the

[OMPI users] Program hangs

2009-11-23 Thread Jiaye Li

Dear Eugene I am sorry that I may not explain the problem clearly last time. The problem is that I tested Ompi with PWscf program on one quadcore node. At the initial several hours, the program went on quite well. When the electronic scf is going to converge, the program started to hang. For exam

Re: [OMPI users] Program hangs

2009-11-23 Thread Eugene Loh

I can't tell if these problems are related to trac ticket 2043 or not. Compiler: In my experience, trac 2043 depends on GCC 4.4.x. It isn't necessarily a GCC bug... perhaps it's just exposing an OMPI problem. I'm confused what compiler Jiaye is using, and Vasilis is apparently seeing a prob

[OMPI users] Program hangs

2009-11-20 Thread Jiaye Li

Hi I killed the job and re-submit it. At this can it can go on to run, but today I found an even serious problem with Ompi. I compared the results of mpich2 and ompi, finding that the results from ompi is wrong, which finished prior to the real end. In other word, the optimized structure (by vasp)

Re: [OMPI users] Program hangs

2009-11-20 Thread vasilis gkanis

Hello, I also experience a similar problem with the MUMPS solver, when I run it on a cluster. After several hours of running the code does not produce any results, although the command top shows that the program occupies 100% of the CPU. The difference here, however, is that the same program ru

[OMPI users] Program hangs

2009-11-19 Thread Jiaye Li

Hello I installed openmpi-1.3.3 on my single node(cpu) intel 64bit quad-core machine. The compiler info is: ** intel-icc101018-10.1.018-1.i386 libgcc-4.4.0-4.i586 gcc-4.4.0-4.i586 gcc-gfor

Re: [OMPI users] Program hangs when run in the remote host ...

2009-10-06 Thread Ashley Pittman

On Tue, 2009-10-06 at 12:22 +0530, souvik bhattacherjee wrote: > This implies that one has to copy the executables in the remote host > each time one requires to run a program which is different from the > previous one. This is correct, the name of the executable is passed to each node and that

Re: [OMPI users] Program hangs when run in the remote host ...

2009-10-06 Thread souvik bhattacherjee

Finally, it seems I'm able to run my program on a remote host. The problem was due to some firewall settings. Modifying the firewall ACCEPT policy as shown below, did the work. # /etc/init.d/ip6tables stop Resetting built-in chains to the default ACCEPT policy: [ OK ] # /etc/init.d/ipta

Re: [OMPI users] Program hangs when run in the remote host ...

2009-09-21 Thread souvik bhattacherjee

As Ralph suggested, I *reversed the order of my PATH settings*: This is what I it shows: $ echo $PATH /usr/local/openmpi-1.3.3/bin/:/usr/bin:/bin:/usr/local/bin:/usr/X11R6/bin/:/usr/games:/usr/lib/qt4/bin:/usr/bin:/opt/kde3/bin $ echo $LD_LIBRARY_PATH /usr/local/openmpi-1.3.3/lib/ Moreover, I c

Re: [OMPI users] Program hangs when run in the remote host ...

2009-09-19 Thread Ralph Castain

One thing that flags my attention. In your PATH definition, you put $PATH ahead of your OMPI 1.3.3 installation. Thus, if there are any system supplied versions of OMPI hanging around (and there often are), they will be executed instead of your new installation. You might try reversing that

Re: [OMPI users] Program hangs when run in the remote host ...

2009-09-19 Thread souvik bhattacherjee

Hi Gus (and all OpenMPI users), Thanks for your interest in my problem. However, the points you had raised earlier in your mails, seems to me that, I had already taken care of them. I had enlisted them below pointwise. Your comments are rewritten in *RED *and my replies in *BLACK.* 1) As you have

Re: [OMPI users] Program hangs when run in the remote host ...

2009-09-18 Thread Gus Correa

Hi Souvik Also worth checking: 1) If you can ssh passwordless from ict1 to ict2 *and* vice versa. 2) If your /etc/hosts file on *both* machines list ict1 and ict2 and their IP addresses. 3) In case you have a /home directory on each machine (i.e. /home is not NFS mounted) if your .bashrc files o

Re: [OMPI users] Program hangs when run in the remote host ...

2009-09-18 Thread Gus Correa

Hi Souvik I would guess you only installed OpenMPI only on ict1, not on ict2. If that is the case you won't have the required OpenMPI libraries on ict:/usr/local, and the job won't run on ict2. I am guessing this, because you used a prefix under /usr/local, which tends to be a "per machine" dir

[OMPI users] Program hangs when run in the remote host ...

2009-09-18 Thread souvik bhattacherjee

Dear all, Myself quite new to Open MPI. Recently, I had installed openmpi-1.3.3 separately on two of my machines ict1 and ict2. These machines are dual-socket quad-core (Intel Xeon E5410) i.e. each having 8 processors and are connected by Gigabit ethernet switch. As a prerequisite, I can ssh betwe

38 matches

Mail list logo