Re: [OMPI users] Eager sending on InfiniBand

2016-05-16 Thread Xiaolong Cui
Hi Nathan, Thanks for your answer. The "credits" make sense for the purpose of flow control. However, the sender in my case will be blocked even for the first message. This doesn't seem to be the symptom of running out of credits. Is there any reason for this? Also, is there a mac parameter for t

Re: [OMPI users] Eager sending on InfiniBand

2016-05-16 Thread Nathan Hjelm
When using eager_rdma the sender will block once it runs out of "credits". If the receiver enters MPI for any reason the incoming messages will be placed in the ob1 unexpected queue and the credits will be returned to the sender. If you turn off eager_rdma you will probably get different results.

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Ralph Castain
I honestly have no idea… > On May 16, 2016, at 10:39 AM, Zabiziz Zaz wrote: > > Ok. > Could you please tell me the latest version that is supported? > > Regards, > Guilherme. > > On Mon, May 16, 2016 at 12:30 PM, Ralph Castain > wrote: > We used to do so, but don’t c

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Zabiziz Zaz
Ok. Could you please tell me the latest version that is supported? Regards, Guilherme. On Mon, May 16, 2016 at 12:30 PM, Ralph Castain wrote: > We used to do so, but don’t currently support that model - folks are > working on restoring it. No timetable, though I don’t think it will be too > muc

[OMPI users] Eager sending on InfiniBand

2016-05-16 Thread Xiaolong Cui
Hi, I am using Open MPI 1.8.6. I guess my question is related to the flow control algorithm for small messages. The question is how to avoid the sender being blocked by the receiver when using *openib* module for small messages and using *blocking send*. I have looked through this FAQ( https://www

Re: [OMPI users] Mpirun invocation only works in debug mode, hangs in "normal" mode.

2016-05-16 Thread Jeff Squyres (jsquyres)
I'm afraid I don't know what the difference is in systemctld for ssh.socket vs. ssh.service, or why that would change Open MPI's behavior. One other thing to try is to mpirun non-MPI programs, like "hostname" and see if that works. This will help distinguish between problems with Open MPI's run

Re: [OMPI users] Incorrect function call in simple C program

2016-05-16 Thread Thomas Jahns
On May 10, 2016, at 12:26 , Gilles Gouaillardet wrote: except if you #include the libc header in your app, *and* your send function has a different prototype, I do not see how clang can issue a warning (except of course if clang "knows" all the libc subroutines ...) not sure if that helps t

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Zabiziz Zaz
My application have a heartbeat that checks if a node is alive and can redistribute a task to another node if the master lost communication with it. The application also have a checkpoint/restart, but since I usually have hundreds of nodes for one job and usually takes a long time to restart the jo

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Ralph Castain
We used to do so, but don’t currently support that model - folks are working on restoring it. No timetable, though I don’t think it will be too much longer before it is in master. Can’t say when it will hit release > On May 16, 2016, at 8:25 AM, Zabiziz Zaz wrote: > > Hi Llolsten, > the proble

[OMPI users] ORTE has lost communication

2016-05-16 Thread Gilles Gouaillardet
What do you mean by fault tolerant application ? from an OpenMPI point of view, if such a connection is lost, your application will no more be able to communicate, so killing it is the best option. if your application has built in checkpoint/restart, then you have to restart it with mpirun after th

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Zabiziz Zaz
Hi Llolsten, the problem is not a firewall issue. The simplest way to reproduce the problem is rebooting a node in the middle of the job. It's possible to configure the openmpi to not terminate the job if, in the middle of the job, one node is rebooted? Thanks again for your help. Regards, Guilhe

Re: [OMPI users] Question about mpirun mca_oob_tcp_recv_handler error.

2016-05-16 Thread Ralph Castain
We already do that as a check, but it came after the 1.6 series - and so you get the old error message if you mix with the 1.6 series or older versions. > On May 16, 2016, at 8:22 AM, Gilles Gouaillardet > wrote: > > or this could be caused by a firewall ... > v1.10 and earlier uses tcp for o

Re: [OMPI users] Question about mpirun mca_oob_tcp_recv_handler error.

2016-05-16 Thread Gilles Gouaillardet
or this could be caused by a firewall ... v1.10 and earlier uses tcp for oob, from v2.x, unix sockets are used detecting consistent version is a good idea, printing them (mpirun, orted and a.out) can be a first step. my idea is mpirun invokes orted with '--ompi_version=x.y.z' orted checks it is

Re: [OMPI users] Question about mpirun mca_oob_tcp_recv_handler error.

2016-05-16 Thread Dave Love
Ralph Castain writes: > This usually indicates that the remote process is using a different OMPI > version. You might check to ensure that the paths on the remote nodes are > correct. That seems quite a common problem with non-obvious failure modes. Is it not possible to have a mechanism that c

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Llolsten Kaonga
Hello Guilherme, This may be off but try running your mpirun command with the option “–tag-output”. If you see a “broken pipe”, then your issue may be firewall related. You could then check the thread “Re: [OMPI users] mpirun command won't run unless the firewalld daemon is disabled” for how

Re: [OMPI users] No core dump in some cases

2016-05-16 Thread Dave Love
Gilles Gouaillardet writes: > Are you sure ulimit -c unlimited is *really* applied on all hosts > > > can you please run the simple program below and confirm that ? Nothing specifically wrong with that, but it's worth installing procenv(1) as a general solution to checking the (generalized) envi

Re: [OMPI users] Building vs packaging

2016-05-16 Thread Dave Love
"Rob Malpass" writes: > Almost in desperation, I cheated: Why is that cheating? Unless you specifically want a different version, it seems sensible to me, especially as you then have access to packaged versions of at least some MPI programs. Likewise with rpm-based systems, which I'm afraid I

Re: [OMPI users] Building vs packaging

2016-05-16 Thread Jeff Squyres (jsquyres)
+1 to everything so far. Also, look in your shell startup files (e.g., $HOME/.bashrc) to see if certain parts of it are not executed for non-interactive logins. A common mistake we see is a shell startup file like this: # ... do setup for all logins ... if (this is a non-interactive login

[OMPI users] ORTE has lost communication

2016-05-16 Thread Zabiziz Zaz
Hi, I'm using openmpi-1.10.2 and sometimes I'm receiving the message below: -- ORTE has lost communication with its daemon located on node: hostname: This is usually due to either a failure of the TCP network connecti

Re: [OMPI users] Building vs packaging

2016-05-16 Thread David Shrader
Hey Rob, I don't know if this is what is going on, but in general, when a package is installed via a distro's package management system, it ends up in system locations such as /usr/bin and /usr/lib that are automatically searched when looking for executables and libraries. So, it isn't necess