Hi Gilles,
I would like to be able to run on anywhere from 1-16 nodes.
Let me explain our (mpi/parallelism) situation briefly for more context:
We have a "master" job that needs MPI functionality. This master job is
written in python (we use mpi4py). The master job then makes spawn calls
out to
Jason,
How many nodes are you running on ?
Since you have an IB network, IB is used for intra node communication
between tasks that are not part of the same OpenMPI job (read spawn group)
I can make a simple patch to use tcp instead of IB for these intra node
communication,
Let me know if you are
Thanks Ralph for all the help. I will do that until it gets fixed.
Nathan, I am very very interested in this working because we are developing
some new cool code for research in materials science. This is the last
piece of the puzzle for us I believe. I can use TCP for now though of
course. While
You don’t want to always use those options as your performance will take a hit
- TCP vs Infiniband isn’t a good option. Sadly, this is something we need
someone like Nathan to address as it is a bug in the code base, and in an area
I’m not familiar with
For now, just use TCP so you can move for
Ralph, The problem *does* go away if I add "-mca btl tcp,sm,self" to the
mpiexec cmd line. (By the way, I am using mpiexec rather than mpirun; do
you recommend one over the other?) Will you tell me what this means for me?
For example, should I always append these arguments to mpiexec for my
non-tes
That message is coming from udcm in the openib btl. It indicates some sort of
failure in the connection mechanism. It can happen if the listening thread no
longer exists or is taking too long to process messages.
-Nathan
On Jun 14, 2016, at 12:20 PM, Ralph Castain wrote:
Hmm…I’m unable to r
Hmm…I’m unable to replicate a problem on my machines. What fabric are you
using? Does the problem go away if you add “-mca btl tcp,sm,self” to the mpirun
cmd line?
> On Jun 14, 2016, at 11:15 AM, Jason Maldonis wrote:
>
> Hi Ralph, et. al,
>
> Great, thank you for the help. I downloaded the m
Hi Ralph, et. al,
Great, thank you for the help. I downloaded the mpi loop spawn test
directly from what I think is the master repo on github:
https://github.com/open-mpi/ompi/blob/master/orte/test/mpi/loop_spawn.c
I am still using the mpi code from 1.10.2, however.
Is that test updated with the
I dug into this a bit (with some help from others) and found that the spawn
code appears to be working correctly - it is the test in orte/test that is
wrong. The test has been correctly updated in the 2.x and master repos, but we
failed to backport it to the 1.10 series. I have done so this morn
No, that PR has nothing to do with loop_spawn. I’ll try to take a look at the
problem.
> On Jun 13, 2016, at 3:47 PM, Jason Maldonis wrote:
>
> Hello,
>
> I am using OpenMPI 1.10.2 compiled with Intel. I am trying to get the spawn
> functionality to work inside a for loop, but continue to get
10 matches
Mail list logo