Re: [OMPI users] OpenMPI PMI2 with SLURM 14.03 not working [SOLVED]

2014-04-11 Thread Anthony Alba
Answered in the slurm-devel list: it is a bug in SLURM 14.03. The fix is already in HEAD and also will be in 14.03.1 https://groups.google.com/forum/#!topic/slurm-devel/1ctPkEn7TFI - Anthony

Re: [OMPI users] Question on suspending/resuming MPI processes with SIGSTOP

2014-04-11 Thread Ralph Castain
I'm afraid our suspend/resume support only allows the signal to be applied to *all* procs, not selectively to some. For that matter, I'm unaware of any MPI-level API for hitting a proc with a signal - so I'm not sure how you would programmatically have rank0 suspend some other ranks. On Apr 11,

Re: [OMPI users] Question on suspending/resuming MPI processes with SIGSTOP

2014-04-11 Thread Frank Wein
Frank Wein wrote: [...] Or basically my question could also be rephrased as: Is there a barrier mechanism I could use in OMPI that causes basically very few to no CPU usage (with higher latency then)? Intel MPI for example seems to have the env var "I_MPI_WAIT_MODE=1" which uses some wait mechanis

[OMPI users] Question on suspending/resuming MPI processes with SIGSTOP

2014-04-11 Thread Frank Wein
Hi, I've got a question on suspending/resuming an process started with "mpirun", I've already found the FAQ entry on this http://www.open-mpi.de/faq/?category=running#suspend-resume but I've still got a question on this. Basically for now let's assume I'm running all MPI processes on one host only

[OMPI users] OpenMPI PMI2 with SLURM 14.03 not working

2014-04-11 Thread Anthony Alba
Not sure if this is a SLURM or OMPI issue so please bear with the cross-posting... The OpenMPI FAQ mentions an issue with slurm 2.6.3/pmi2. https://www.open-mpi.org/faq/?category=slurm#slurm-2.6.3-issue I have built both 1.7.5/1.8 against slurm 14.03/pmi2. When I launch openmpi/examples/hello_c

Re: [OMPI users] Troubleshooting mpirun with tree spawn hang

2014-04-11 Thread Anthony Alba
Ooops I meant = false. Thanks for the tip, it turns out the fault lay in a specific node that required oob_tcp_if_include to be set. On Friday, 11 April 2014, Ralph Castain wrote: > I'm a little confused - the "no_tree_spawn=true" option means that we are > *not* using tree spawn, and so mpirun

Re: [OMPI users] mpirun problem when running on more than three hosts with OpenMPI 1.8

2014-04-11 Thread Ralph Castain
The problem is with the tree-spawn nature of the rsh/ssh launcher. For scalability, mpirun only launches a first "layer" of daemons. Each of those daemons then launches another layer in a tree-like fanout. The default pattern is such that you first notice it when you have four nodes in your allo

[OMPI users] mpirun problem when running on more than three hosts with OpenMPI 1.8

2014-04-11 Thread Allan Wu
Hello everyone, I am running a simple helloworld program on several nodes using OpenMPI 1.8. Running commands on single node or small number of nodes are successful, but when I tried to run the same binary on four different nodes, problems occurred. I am using 'mpirun' command line like the follo

Re: [OMPI users] can't run mpi-jobs on remote host

2014-04-11 Thread Ralph Castain
Please see: http://www.open-mpi.org/faq/?category=rsh#ssh-keys short answer: you need to be able to ssh to the remote hosts without a password On Apr 11, 2014, at 1:09 AM, Lubrano Francesco wrote: > Dear MPI users, > I have a problem with open-mpi (version 1.8). > I'm just beginning to undes

Re: [OMPI users] Troubleshooting mpirun with tree spawn hang

2014-04-11 Thread Ralph Castain
I'm a little confused - the "no_tree_spawn=true" option means that we are *not* using tree spawn, and so mpirun is directly launching each daemon onto its node. Thus, this requires that the host mpirun is on be able to ssh to every other host in the allocation. You can debug the rsh launcher by

Re: [OMPI users] File locking in ADIO, OpenMPI 1.6.4

2014-04-11 Thread Jeff Squyres (jsquyres)
Sorry for the delay in replying. Can you try upgrading to Open MPI 1.8, which was released last week? We refreshed the version of ROMIO that is included in OMPI 1.8 vs. 1.6. On Apr 8, 2014, at 6:49 PM, Daniel Milroy wrote: > Hello, > > Recently a couple of our users have experienced diffic

Re: [OMPI users] OpenMPI 1.8.0 + PGI 13.6 = undeclared variable __LDBL_MANT_DIG__

2014-04-11 Thread Jeff Squyres (jsquyres)
On Apr 9, 2014, at 8:47 PM, Filippo Spiga wrote: > I haven't solve this yet but I managed to move to code to be compatible woth > PGI 14.3. Open MPI 1.8 compiles perfectly with the latest PGI. > > In parallel I will push this issue to the PGI forum. FWIW: We've seen this kind of issue before

[OMPI users] can't run mpi-jobs on remote host

2014-04-11 Thread Lubrano Francesco
Dear MPI users, I have a problem with open-mpi (version 1.8). I'm just beginning to undestand how mpi works and I can't find solution of my problem on FAQ page. I have two machines (a local host and a remote host) with linux open-suse (latest version) and open-mpi 1.8 I can run open-mpi jobs

Re: [OMPI users] Optimal mapping/binding when threads are used?

2014-04-11 Thread Saliya Ekanayake
Thank you Ralph for the details and it's a good point you mentioned on mapping by node vs socket. We have another program that uses a chain of send receives, which will benefit from having consecutive ranks nearby. I've a question on bind to none being equal to bind to all. I understand the two co

[OMPI users] Troubleshooting mpirun with tree spawn hang

2014-04-11 Thread Anthony Alba
Is there a way to troubleshoot plm_rsh_no_tree_spawn=true hang? I have a set of passwordless-ssh nodes, each node can ssh into any other., i.e., for h1 in A B C D; do for h2 in A B C D; do ssh $h1 ssh $h2 hostname; done; done works perfectly. Generally tree spawn works, however there is one hos

Re: [OMPI users] Optimal mapping/binding when threads are used?

2014-04-11 Thread Ralph Castain
Interesting data. Couple of quick points that might help: option B is equivalent to --map-by node --bind-to none. When you bind to every core on the node, we don't bind you at all since "bind to all" is exactly equivalent to "bind to none". So it will definitely run slower as the threads run ac

Re: [OMPI users] Performance issue of mpirun/mpi_init

2014-04-11 Thread Ralph Castain
I shaved about 30% off the time - the patch is waiting for 1.8.1, but you can try it now (see the ticket for the changeset): https://svn.open-mpi.org/trac/ompi/ticket/4510#comment:1 I've added you to the ticket so you can follow what I'm doing. Getting any further improvement will take a little