Re: [OMPI users] SGE segfaulting with OpenMPI 1.8.6

m.delorme Thu, 23 Jul 2015 11:08:55 -0400 (EDT)

hi,


Thanks for the quick answer.

I am actually using the module environment, and made my own module for 
openmpi-1.8.6 prepending the paths.


I was so desperate to get the env right that I doubled everything : my script 
is running with the -V flag, I am loading the modules, and printing the env. 
This returns the right PATH and LD_LIBRARY_PATH

The problem is that printing the env before mpirun will give me the environment 
of the master node running mpirun but not the nodes where the program will 
really be executed.

On the other hand, if I just try to put the env in a mpirun, then the whole 
thing segfaults as previously.


So I am not sure I have a proper way to ensure my env variable are right.


MD

________________________________
From: users <users-boun...@open-mpi.org> on behalf of John Hearns 
<hear...@googlemail.com>
Sent: Thursday, July 23, 2015 3:53 PM
To: Open MPI Users
Subject: Re: [OMPI users] SGE segfaulting with OpenMPI 1.8.6

You say that you can run the code OK 'by hand' with an mpirun.

Are you assuming somehow that the Gridengine jobs will inherit your environment 
variables, paths etc?
If I remember correctly, you should submit wiht the  -V  option to pass over 
environment settings.
Even better, make sure that the job script itself sets all the paths and 
variables.
Have you looked at using the 'modules' environment?

Also submit a job script and put the 'env' command in as the first command.
Look at your output closely.




On 23 July 2015 at 15:45, 
<m.delo...@surrey.ac.uk<mailto:m.delo...@surrey.ac.uk>> wrote:

Hello,


I have been working on this problem for the last week, browsing the help and 
the mailing list with no success.

While trying to run MPI programs using SGE, I end up with seg faults every time.


A bit of information on the system :


I am working on a 14 nodes cluster. Every node is an Intel Xeon, each composed 
of 2 sockets with 10 cores each (so 20 cores per node). The nodes are 
Infiniband connected. The job scheduler is Grid Engine as mentioned before.

Since I don't have the hand on the cluster administration, and the "default" 
installation of openMPI is an old one, I compiled and installed myself Open-MPI 
1.8.6 and prepended paths (general and library) to ensure the use of my version 
of mpi.


Open MPI has been configured with the flags --with-sge, and grepping grid 
engine in ompi_info returns something that looks correct :


MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.8.6)



Now when running a simple script, displaying the hostname, on two slots binded 
on one single node, I get the following message :


[galaxy1:44361] Error: unknown option "--hnp-topo-sig"

Segmentation fault

--------------------------------------------------------------------------

ORTE was unable to reliably start one or more daemons.

This usually is caused by:


* not finding the required libraries and/or binaries on

  one or more nodes. Please check your PATH and LD_LIBRARY_PATH

  settings, or configure OMPI with --enable-orterun-prefix-by-default


* lack of authority to execute on one or more specified nodes.

  Please verify your allocation and authorities.


* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).

  Please check with your sys admin to determine the correct location to use.


*  compilation of the orted with dynamic libraries when static are required

  (e.g., on Cray). Please check your configure cmd line and consider using

  one of the contrib/platform definitions for your system type.


* an inability to create a connection back to mpirun due to a

  lack of common network interfaces and/or no route found between

  them. Please check network connectivity (including firewalls

  and network routing requirements).

--------------------------------------------------------------------------


When I connect to the specific host crashing and try to run the program by hand 
with mpirun, the whole thing executes without problem.
I made sure the libraries and path are right, that I have the rights on the 
node, that /tmp is accessible. I don't think the fourth point of the list is 
the problem, as for the last one, I suppose that if I can access the node by 
sshing it, SGE shouldn't have a problem with it as well ...

My guess is then a problem from SGE or the integration of OpenMPI with SGE ...

I googled with no real success the "hnp-topo-sig", and only got to a 
stackoverflow page indicating that the problem should be nodes running a 
different version of OpenMPI.
I know that there is an old OpenMPI version by default on the nodes, but 
shouldn't prepending the paths and exporting the environment (using the -V flag 
in the script) be sufficient to ensure the right version of openMPI is used ?

A bit of additional information,

qconf -se orte :

pe_name            orte
slots              2000
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE


You will find attached the compressed log of ompi_info -a --parsable



Thank you very much in advance for any suggestion,


MD



_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/07/27312.php

Re: [OMPI users] SGE segfaulting with OpenMPI 1.8.6

Reply via email to