Answered in the slurm-devel list: it is a bug in SLURM 14.03.
The fix is already in HEAD and also will be in 14.03.1
https://groups.google.com/forum/#!topic/slurm-devel/1ctPkEn7TFI
- Anthony
I'm afraid our suspend/resume support only allows the signal to be applied to
*all* procs, not selectively to some. For that matter, I'm unaware of any
MPI-level API for hitting a proc with a signal - so I'm not sure how you would
programmatically have rank0 suspend some other ranks.
On Apr 11,
Frank Wein wrote:
[...]
Or basically my question could also be rephrased as: Is there a barrier
mechanism I could use in OMPI that causes basically very few to no CPU
usage (with higher latency then)? Intel MPI for example seems to have
the env var "I_MPI_WAIT_MODE=1" which uses some wait mechanis
Hi,
I've got a question on suspending/resuming an process started with
"mpirun", I've already found the FAQ entry on this
http://www.open-mpi.de/faq/?category=running#suspend-resume but I've
still got a question on this. Basically for now let's assume I'm running
all MPI processes on one host only
Not sure if this is a SLURM or OMPI issue so please bear with the
cross-posting...
The OpenMPI FAQ mentions an issue with slurm 2.6.3/pmi2.
https://www.open-mpi.org/faq/?category=slurm#slurm-2.6.3-issue
I have built both 1.7.5/1.8 against slurm 14.03/pmi2.
When I launch openmpi/examples/hello_c
Ooops I meant = false.
Thanks for the tip, it turns out the fault lay in a specific node that
required oob_tcp_if_include to be set.
On Friday, 11 April 2014, Ralph Castain wrote:
> I'm a little confused - the "no_tree_spawn=true" option means that we are
> *not* using tree spawn, and so mpirun
The problem is with the tree-spawn nature of the rsh/ssh launcher. For
scalability, mpirun only launches a first "layer" of daemons. Each of those
daemons then launches another layer in a tree-like fanout. The default pattern
is such that you first notice it when you have four nodes in your allo
Hello everyone,
I am running a simple helloworld program on several nodes using OpenMPI
1.8. Running commands on single node or small number of nodes are
successful, but when I tried to run the same binary on four different
nodes, problems occurred.
I am using 'mpirun' command line like the follo
Please see:
http://www.open-mpi.org/faq/?category=rsh#ssh-keys
short answer: you need to be able to ssh to the remote hosts without a password
On Apr 11, 2014, at 1:09 AM, Lubrano Francesco
wrote:
> Dear MPI users,
> I have a problem with open-mpi (version 1.8).
> I'm just beginning to undes
I'm a little confused - the "no_tree_spawn=true" option means that we are *not*
using tree spawn, and so mpirun is directly launching each daemon onto its
node. Thus, this requires that the host mpirun is on be able to ssh to every
other host in the allocation.
You can debug the rsh launcher by
Sorry for the delay in replying.
Can you try upgrading to Open MPI 1.8, which was released last week? We
refreshed the version of ROMIO that is included in OMPI 1.8 vs. 1.6.
On Apr 8, 2014, at 6:49 PM, Daniel Milroy wrote:
> Hello,
>
> Recently a couple of our users have experienced diffic
On Apr 9, 2014, at 8:47 PM, Filippo Spiga wrote:
> I haven't solve this yet but I managed to move to code to be compatible woth
> PGI 14.3. Open MPI 1.8 compiles perfectly with the latest PGI.
>
> In parallel I will push this issue to the PGI forum.
FWIW: We've seen this kind of issue before
Dear MPI users,
I have a problem with open-mpi (version 1.8).
I'm just beginning to undestand how mpi works and I can't find solution of my
problem on FAQ page.
I have two machines (a local host and a remote host) with linux open-suse
(latest version) and open-mpi 1.8
I can run open-mpi jobs
Thank you Ralph for the details and it's a good point you mentioned on
mapping by node vs socket. We have another program that uses a chain of
send receives, which will benefit from having consecutive ranks nearby.
I've a question on bind to none being equal to bind to all. I understand
the two co
Is there a way to troubleshoot
plm_rsh_no_tree_spawn=true hang?
I have a set of passwordless-ssh nodes, each node can ssh into any other.,
i.e.,
for h1 in A B C D; do for h2 in A B C D; do ssh $h1 ssh $h2 hostname; done;
done
works perfectly.
Generally tree spawn works, however there is one hos
Interesting data. Couple of quick points that might help:
option B is equivalent to --map-by node --bind-to none. When you bind to every
core on the node, we don't bind you at all since "bind to all" is exactly
equivalent to "bind to none". So it will definitely run slower as the threads
run ac
I shaved about 30% off the time - the patch is waiting for 1.8.1, but you can
try it now (see the ticket for the changeset):
https://svn.open-mpi.org/trac/ompi/ticket/4510#comment:1
I've added you to the ticket so you can follow what I'm doing. Getting any
further improvement will take a little
17 matches
Mail list logo