Just out of curiosity, I saw that one of the segv stack traces involved the 
cuda stack.

Can you try a build without CUDA and see if that resolves the problem?



On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

> Hi Jeff,
> 
> Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :
>> On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault 
>> <maxime.boissonnea...@calculquebec.ca> wrote:
>> 
>>> Correct.
>>> 
>>> Can it be because torque (pbs_mom) is not running on the head node and 
>>> mpiexec attempts to contact it ?
>> Not for Open MPI's mpiexec, no.
>> 
>> Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use 
>> TM stuff (i.e., Torque stuff) if it sees the environment variable markers 
>> indicating that it's inside a Torque job.  If not, it just uses rsh/ssh (or 
>> localhost launch in your case, since you didn't specify any hosts).
>> 
>> If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI 
>> "hostname" command from Linux), then something is seriously borked with your 
>> Open MPI installation.
> mpirun -np 4 hostname works fine :
> [mboisson@helios-login1 ~]$ which mpirun
> /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun
> [mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $?
> helios-login1
> helios-login1
> helios-login1
> helios-login1
> 0
> 
>> 
>> Try running with:
>> 
>>     mpirun -np 4 --mca plm_base_verbose 10 hostname
>> 
>> This should show the steps OMPI is trying to take to launch the 4 copies of 
>> "hostname" and potentially give some insight into where it's hanging.
>> 
>> Also, just to make sure, you have ensured that you're compiling everything 
>> with a single compiler toolchain, and the support libraries from that 
>> specific compiler toolchain are available on any server on which you're 
>> running (to include the head node and compute nodes), right?
> Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the 
> same results). Almost every software (that is compiler, toolchain, etc.) is 
> installed on lustre, from sources and is the same on both the login (head) 
> node and the compute.
> 
> The few differences between the head node and the compute :
> 1) Computes are in RAMFS - login is installed on disk
> 2) Computes and login node have different hardware configuration (computes 
> have GPUs, head node does not).
> 3) Login node has MORE CentOS6 packages than computes (such as the -devel 
> packages, some fonts/X11 libraries, etc.), but all the packages that are on 
> the computes are also on the login node.
> 
>> 
>> And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right 
>> places -- i.e., to the Open MPI installation that you expect it to point to. 
>>  E.g., if you "ldd ring_c", it shows the libmpi.so that you expect.  And 
>> "which mpiexec" shows the mpirun that you expect.  Etc.
> As per the content of "env.out" in the archive, yes. They point to the OMPI 
> 1.8.2rc4 installation directories, on lustre, and are shared between the head 
> node and the compute nodes.
> 
> 
> Maxime
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25043.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to