Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Indeed odd - I'm afraid that this is just the kind of case that has been causing problems. I think I've figured out the problem, but have been buried with my "day job" for the last few weeks and unable to pursue it. On Aug 18, 2014, at 11:10 AM, Maxime Boissonneault wrote: > Ok, I confirm th

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
Ok, I confirm that with mpiexec -mca oob_tcp_if_include lo ring_c it works. It also works with mpiexec -mca oob_tcp_if_include ib0 ring_c We have 4 interfaces on this node. - lo, the local loop - ib0, infiniband - eth2, a management network - eth3, the public network It seems that mpiexec atte

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Yeah, there are some issues with the internal connection logic that need to get fixed. We haven't had many cases where it's been an issue, but a couple like this have cropped up - enough that I need to set aside some time to fix it. My apologies for the problem. On Aug 18, 2014, at 10:31 AM, M

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
Indeed, that makes sense now. Why isn't OpenMPI attempting to connect with the local loop for same node ? This used to work with 1.6.5. Maxime Le 2014-08-18 13:11, Ralph Castain a écrit : Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [heli

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer [[63019,0],0] on socket 11 [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: connection failed: Co

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
Here it is. Maxime Le 2014-08-18 12:59, Ralph Castain a écrit : Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: This is all one one node indeed. Attached is th

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: > This is all one one node indeed. > > Attached is the output of > mpirun -np 4 --mca plm_base_verbose 10 -mca odls_

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
This is all one one node indeed. Attached is the output of mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee output_ringc_verbose.txt Maxime Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one nod

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: > Here it is > Le 2014-08-18 12:30, Joshua Ladd

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: compone

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Joshua Ladd
Maxime, Can you run with: mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault < maxime.boissonnea...@calculquebec.ca> wrote: > Hi, > I just did compile without Cuda, and the result is the same. No output, > exits with code

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
Hi, I just did compile without Cuda, and the result is the same. No output, exits with code 65. [mboisson@helios-login1 examples]$ ldd ring_c linux-vdso.so.1 => (0x7fff3ab31000) libmpi.so.1 => /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1 (0x7fab9ec

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-16 Thread Maxime Boissonneault
There is indeed also a problem with MPI + Cuda. This problem however is deeper, since it happens with Mvapich2 1.9, OpenMPI 1.6.5/1.8.1/1.8.2rc4, Cuda 5.5.22/6.0.37. From my tests, everything works fine with MPI + Cuda on a single node, but as soon as I got to MPI + Cuda accross nodes, I get s

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-16 Thread Jeff Squyres (jsquyres)
Just out of curiosity, I saw that one of the segv stack traces involved the cuda stack. Can you try a build without CUDA and see if that resolves the problem? On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault wrote: > Hi Jeff, > > Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit : >> O

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-15 Thread Maxime Boissonneault
Hi Jeff, Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit : On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault wrote: Correct. Can it be because torque (pbs_mom) is not running on the head node and mpiexec attempts to contact it ? Not for Open MPI's mpiexec, no. Open MPI's mpiexec (mp

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-15 Thread Jeff Squyres (jsquyres)
On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault wrote: > Correct. > > Can it be because torque (pbs_mom) is not running on the head node and > mpiexec attempts to contact it ? Not for Open MPI's mpiexec, no. Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM stu

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-15 Thread Maxime Boissonneault
Correct. Can it be because torque (pbs_mom) is not running on the head node and mpiexec attempts to contact it ? Maxime Le 2014-08-15 17:31, Joshua Ladd a écrit : But OMPI 1.8.x does run the ring_c program successfully on your compute node, right? The error only happens on the front-end logi

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-15 Thread Joshua Ladd
But OMPI 1.8.x does run the ring_c program successfully on your compute node, right? The error only happens on the front-end login node if I understood you correctly. Josh On Fri, Aug 15, 2014 at 5:20 PM, Maxime Boissonneault < maxime.boissonnea...@calculquebec.ca> wrote: > Here are the reques

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-15 Thread Maxime Boissonneault
Here are the requested files. In the archive, you will find the output of configure, make, make install as well as the config.log, the environment when running ring_c and the ompi_info --all. Just for a reminder, the ring_c example compiled and ran, but produced no output when running and ex

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-15 Thread Maxime Boissonneault
Hi, I solved the warning that appeared with OpenMPI 1.6.5 on the login node. I increased the registrable memory. Now, with OpenMPI 1.6.5, it does not give any warning. Yet, with OpenMPI 1.8.1 and OpenMPI 1.8.2rc4, it still exits with error code 65 and does not produce the normal output. I w

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-15 Thread Maxime Boissonneault
Hi Josh, The ring_c example does not work on our login node : [mboisson@helios-login1 examples]$ mpiexec -np 10 ring_c [mboisson@helios-login1 examples]$ echo $? 65 [mboisson@helios-login1 examples]$ echo $LD_LIBRARY_PATH /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib:/usr/lib64/nvidia:

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
One more, Maxime, can you please make sure you've covered everything here: http://www.open-mpi.org/community/help/ Josh On Thu, Aug 14, 2014 at 3:18 PM, Joshua Ladd wrote: > And maybe include your LD_LIBRARY_PATH > > Josh > > > On Thu, Aug 14, 2014 at 3:16 PM, Joshua Ladd wrote: > >> Can you

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
And maybe include your LD_LIBRARY_PATH Josh On Thu, Aug 14, 2014 at 3:16 PM, Joshua Ladd wrote: > Can you try to run the example code "ring_c" across nodes? > > Josh > > > On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault < > maxime.boissonnea...@calculquebec.ca> wrote: > >> Yes, >> Every

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
Can you try to run the example code "ring_c" across nodes? Josh On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault < maxime.boissonnea...@calculquebec.ca> wrote: > Yes, > Everything has been built with GCC 4.8.x, although x might have changed > between the OpenMPI 1.8.1 build and the gromac

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Maxime Boissonneault
Yes, Everything has been built with GCC 4.8.x, although x might have changed between the OpenMPI 1.8.1 build and the gromacs build. For OpenMPI 1.8.2rc4 however, it was the exact same compiler for everything. Maxime Le 2014-08-14 14:57, Joshua Ladd a écrit : Hmmm...weird. Seems like maybe a m

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
Hmmm...weird. Seems like maybe a mismatch between libraries. Did you build OMPI with the same compiler as you did GROMACS/Charm++? I'm stealing this suggestion from an old Gromacs forum with essentially the same symptom: "Did you compile Open MPI and Gromacs with the same compiler (i.e. both gcc

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Maxime Boissonneault
I just tried Gromacs with two nodes. It crashes, but with a different error. I get [gpu-k20-13:142156] *** Process received signal *** [gpu-k20-13:142156] Signal: Segmentation fault (11) [gpu-k20-13:142156] Signal code: Address not mapped (1) [gpu-k20-13:142156] Failing at address: 0x8 [gpu-k20-1

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
What about between nodes? Since this is coming from the OpenIB BTL, would be good to check this. Do you know what the MPI thread level is set to when used with the Charm++ runtime? Is it MPI_THREAD_MULTIPLE? The OpenIB BTL is not thread safe. Josh On Thu, Aug 14, 2014 at 2:17 PM, Maxime Boisson

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Maxime Boissonneault
Hi, I ran gromacs successfully with OpenMPI 1.8.1 and Cuda 6.0.37 on a single node, with 8 ranks and multiple OpenMP threads. Maxime Le 2014-08-14 14:15, Joshua Ladd a écrit : Hi, Maxime Just curious, are you able to run a vanilla MPI program? Can you try one one of the example programs in

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Joshua Ladd
Hi, Maxime Just curious, are you able to run a vanilla MPI program? Can you try one one of the example programs in the "examples" subdirectory. Looks like a threading issue to me. Thanks, Josh

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Maxime Boissonneault
Hi, I just did with 1.8.2rc4 and it does the same : [mboisson@helios-login1 simplearrayhello]$ ./hello [helios-login1:11739] *** Process received signal *** [helios-login1:11739] Signal: Segmentation fault (11) [helios-login1:11739] Signal code: Address not mapped (1) [helios-login1:11739] Failin

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Jeff Squyres (jsquyres)
Can you try the latest 1.8.2 rc tarball? (just released yesterday) http://www.open-mpi.org/software/ompi/v1.8/ On Aug 14, 2014, at 8:39 AM, Maxime Boissonneault wrote: > Hi, > I compiled Charm++ 6.6.0rc3 using > ./build charm++ mpi-linux-x86_64 smp --with-production > > When compiling

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-14 Thread Maxime Boissonneault
Note that if I do the same build with OpenMPI 1.6.5, it works flawlessly. Maxime Le 2014-08-14 08:39, Maxime Boissonneault a écrit : Hi, I compiled Charm++ 6.6.0rc3 using ./build charm++ mpi-linux-x86_64 smp --with-production When compiling the simple example mpi-linux-x86_64-smp/tests/charm+