[OMPI users] SGE segfaulting with OpenMPI 1.8.6

m.delorme Thu, 23 Jul 2015 10:45:34 -0400 (EDT)

Hello,


I have been working on this problem for the last week, browsing the help and 
the mailing list with no success.

While trying to run MPI programs using SGE, I end up with seg faults every time.


A bit of information on the system :


I am working on a 14 nodes cluster. Every node is an Intel Xeon, each composed 
of 2 sockets with 10 cores each (so 20 cores per node). The nodes are 
Infiniband connected. The job scheduler is Grid Engine as mentioned before.

Since I don't have the hand on the cluster administration, and the "default" 
installation of openMPI is an old one, I compiled and installed myself Open-MPI 
1.8.6 and prepended paths (general and library) to ensure the use of my version 
of mpi.


Open MPI has been configured with the flags --with-sge, and grepping grid 
engine in ompi_info returns something that looks correct :


MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.8.6)



Now when running a simple script, displaying the hostname, on two slots binded 
on one single node, I get the following message :


[galaxy1:44361] Error: unknown option "--hnp-topo-sig"

Segmentation fault

--------------------------------------------------------------------------

ORTE was unable to reliably start one or more daemons.

This usually is caused by:


* not finding the required libraries and/or binaries on

  one or more nodes. Please check your PATH and LD_LIBRARY_PATH

  settings, or configure OMPI with --enable-orterun-prefix-by-default


* lack of authority to execute on one or more specified nodes.

  Please verify your allocation and authorities.


* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).

  Please check with your sys admin to determine the correct location to use.


*  compilation of the orted with dynamic libraries when static are required

  (e.g., on Cray). Please check your configure cmd line and consider using

  one of the contrib/platform definitions for your system type.


* an inability to create a connection back to mpirun due to a

  lack of common network interfaces and/or no route found between

  them. Please check network connectivity (including firewalls

  and network routing requirements).

--------------------------------------------------------------------------


When I connect to the specific host crashing and try to run the program by hand 
with mpirun, the whole thing executes without problem.
I made sure the libraries and path are right, that I have the rights on the 
node, that /tmp is accessible. I don't think the fourth point of the list is 
the problem, as for the last one, I suppose that if I can access the node by 
sshing it, SGE shouldn't have a problem with it as well ...

My guess is then a problem from SGE or the integration of OpenMPI with SGE ...

I googled with no real success the "hnp-topo-sig", and only got to a 
stackoverflow page indicating that the problem should be nodes running a 
different version of OpenMPI.
I know that there is an old OpenMPI version by default on the nodes, but 
shouldn't prepending the paths and exporting the environment (using the -V flag 
in the script) be sufficient to ensure the right version of openMPI is used ?

A bit of additional information,

qconf -se orte :

pe_name            orte
slots              2000
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE


You will find attached the compressed log of ompi_info -a --parsable



Thank you very much in advance for any suggestion,


MD

[OMPI users] SGE segfaulting with OpenMPI 1.8.6

Reply via email to