Indeed odd - I'm afraid that this is just the kind of case that has been
causing problems. I think I've figured out the problem, but have been buried
with my "day job" for the last few weeks and unable to pursue it.
On Aug 18, 2014, at 11:10 AM, Maxime Boissonneault
wrote:
> Ok, I confirm th
Ok, I confirm that with
mpiexec -mca oob_tcp_if_include lo ring_c
it works.
It also works with
mpiexec -mca oob_tcp_if_include ib0 ring_c
We have 4 interfaces on this node.
- lo, the local loop
- ib0, infiniband
- eth2, a management network
- eth3, the public network
It seems that mpiexec atte
Yeah, there are some issues with the internal connection logic that need to get
fixed. We haven't had many cases where it's been an issue, but a couple like
this have cropped up - enough that I need to set aside some time to fix it.
My apologies for the problem.
On Aug 18, 2014, at 10:31 AM, M
Indeed, that makes sense now.
Why isn't OpenMPI attempting to connect with the local loop for same
node ? This used to work with 1.6.5.
Maxime
Le 2014-08-18 13:11, Ralph Castain a écrit :
Yep, that pinpointed the problem:
[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[heli
Yep, that pinpointed the problem:
[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect:
connection failed: Co
Here it is.
Maxime
Le 2014-08-18 12:59, Ralph Castain a écrit :
Ah...now that showed the problem. To pinpoint it better, please add
-mca oob_base_verbose 10
and I think we'll have it
On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault
wrote:
This is all one one node indeed.
Attached is th
Ah...now that showed the problem. To pinpoint it better, please add
-mca oob_base_verbose 10
and I think we'll have it
On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault
wrote:
> This is all one one node indeed.
>
> Attached is the output of
> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_
This is all one one node indeed.
Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca
state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee
output_ringc_verbose.txt
Maxime
Le 2014-08-18 12:48, Ralph Castain a écrit :
This is all on one nod
This is all on one node, yes?
Try adding the following:
-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5
Lot of garbage, but should tell us what is going on.
On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault
wrote:
> Here it is
> Le 2014-08-18 12:30, Joshua Ladd
Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :
mpirun -np 4 --mca plm_base_verbose 10
[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose
10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm
components
[helios-login1:27853] mca: base: compone
Maxime,
Can you run with:
mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c
On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:
> Hi,
> I just did compile without Cuda, and the result is the same. No output,
> exits with code
Hi,
I just did compile without Cuda, and the result is the same. No output,
exits with code 65.
[mboisson@helios-login1 examples]$ ldd ring_c
linux-vdso.so.1 => (0x7fff3ab31000)
libmpi.so.1 =>
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1
(0x7fab9ec
There is indeed also a problem with MPI + Cuda.
This problem however is deeper, since it happens with Mvapich2 1.9,
OpenMPI 1.6.5/1.8.1/1.8.2rc4, Cuda 5.5.22/6.0.37. From my tests,
everything works fine with MPI + Cuda on a single node, but as soon as I
got to MPI + Cuda accross nodes, I get s
Just out of curiosity, I saw that one of the segv stack traces involved the
cuda stack.
Can you try a build without CUDA and see if that resolves the problem?
On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault
wrote:
> Hi Jeff,
>
> Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :
>> O
Hi Jeff,
Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :
On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault
wrote:
Correct.
Can it be because torque (pbs_mom) is not running on the head node and mpiexec
attempts to contact it ?
Not for Open MPI's mpiexec, no.
Open MPI's mpiexec (mp
On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault
wrote:
> Correct.
>
> Can it be because torque (pbs_mom) is not running on the head node and
> mpiexec attempts to contact it ?
Not for Open MPI's mpiexec, no.
Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM
stu
Correct.
Can it be because torque (pbs_mom) is not running on the head node and
mpiexec attempts to contact it ?
Maxime
Le 2014-08-15 17:31, Joshua Ladd a écrit :
But OMPI 1.8.x does run the ring_c program successfully on your
compute node, right? The error only happens on the front-end logi
But OMPI 1.8.x does run the ring_c program successfully on your compute
node, right? The error only happens on the front-end login node if I
understood you correctly.
Josh
On Fri, Aug 15, 2014 at 5:20 PM, Maxime Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:
> Here are the reques
Here are the requested files.
In the archive, you will find the output of configure, make, make
install as well as the config.log, the environment when running ring_c
and the ompi_info --all.
Just for a reminder, the ring_c example compiled and ran, but produced
no output when running and ex
Hi,
I solved the warning that appeared with OpenMPI 1.6.5 on the login node.
I increased the registrable memory.
Now, with OpenMPI 1.6.5, it does not give any warning. Yet, with OpenMPI
1.8.1 and OpenMPI 1.8.2rc4, it still exits with error code 65 and does
not produce the normal output.
I w
Hi Josh,
The ring_c example does not work on our login node :
[mboisson@helios-login1 examples]$ mpiexec -np 10 ring_c
[mboisson@helios-login1 examples]$ echo $?
65
[mboisson@helios-login1 examples]$ echo $LD_LIBRARY_PATH
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib:/usr/lib64/nvidia:
One more, Maxime, can you please make sure you've covered everything here:
http://www.open-mpi.org/community/help/
Josh
On Thu, Aug 14, 2014 at 3:18 PM, Joshua Ladd wrote:
> And maybe include your LD_LIBRARY_PATH
>
> Josh
>
>
> On Thu, Aug 14, 2014 at 3:16 PM, Joshua Ladd wrote:
>
>> Can you
And maybe include your LD_LIBRARY_PATH
Josh
On Thu, Aug 14, 2014 at 3:16 PM, Joshua Ladd wrote:
> Can you try to run the example code "ring_c" across nodes?
>
> Josh
>
>
> On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault <
> maxime.boissonnea...@calculquebec.ca> wrote:
>
>> Yes,
>> Every
Can you try to run the example code "ring_c" across nodes?
Josh
On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:
> Yes,
> Everything has been built with GCC 4.8.x, although x might have changed
> between the OpenMPI 1.8.1 build and the gromac
Yes,
Everything has been built with GCC 4.8.x, although x might have changed
between the OpenMPI 1.8.1 build and the gromacs build. For OpenMPI
1.8.2rc4 however, it was the exact same compiler for everything.
Maxime
Le 2014-08-14 14:57, Joshua Ladd a écrit :
Hmmm...weird. Seems like maybe a m
Hmmm...weird. Seems like maybe a mismatch between libraries. Did you build
OMPI with the same compiler as you did GROMACS/Charm++?
I'm stealing this suggestion from an old Gromacs forum with essentially the
same symptom:
"Did you compile Open MPI and Gromacs with the same compiler (i.e. both gcc
I just tried Gromacs with two nodes. It crashes, but with a different
error. I get
[gpu-k20-13:142156] *** Process received signal ***
[gpu-k20-13:142156] Signal: Segmentation fault (11)
[gpu-k20-13:142156] Signal code: Address not mapped (1)
[gpu-k20-13:142156] Failing at address: 0x8
[gpu-k20-1
What about between nodes? Since this is coming from the OpenIB BTL, would
be good to check this.
Do you know what the MPI thread level is set to when used with the Charm++
runtime? Is it MPI_THREAD_MULTIPLE? The OpenIB BTL is not thread safe.
Josh
On Thu, Aug 14, 2014 at 2:17 PM, Maxime Boisson
Hi,
I ran gromacs successfully with OpenMPI 1.8.1 and Cuda 6.0.37 on a
single node, with 8 ranks and multiple OpenMP threads.
Maxime
Le 2014-08-14 14:15, Joshua Ladd a écrit :
Hi, Maxime
Just curious, are you able to run a vanilla MPI program? Can you try
one one of the example programs in
Hi, Maxime
Just curious, are you able to run a vanilla MPI program? Can you try one
one of the example programs in the "examples" subdirectory. Looks like a
threading issue to me.
Thanks,
Josh
Hi,
I just did with 1.8.2rc4 and it does the same :
[mboisson@helios-login1 simplearrayhello]$ ./hello
[helios-login1:11739] *** Process received signal ***
[helios-login1:11739] Signal: Segmentation fault (11)
[helios-login1:11739] Signal code: Address not mapped (1)
[helios-login1:11739] Failin
Can you try the latest 1.8.2 rc tarball? (just released yesterday)
http://www.open-mpi.org/software/ompi/v1.8/
On Aug 14, 2014, at 8:39 AM, Maxime Boissonneault
wrote:
> Hi,
> I compiled Charm++ 6.6.0rc3 using
> ./build charm++ mpi-linux-x86_64 smp --with-production
>
> When compiling
Note that if I do the same build with OpenMPI 1.6.5, it works flawlessly.
Maxime
Le 2014-08-14 08:39, Maxime Boissonneault a écrit :
Hi,
I compiled Charm++ 6.6.0rc3 using
./build charm++ mpi-linux-x86_64 smp --with-production
When compiling the simple example
mpi-linux-x86_64-smp/tests/charm+
33 matches
Mail list logo