Hello,
I have been working on this problem for the last week, browsing the help and the mailing list with no success. While trying to run MPI programs using SGE, I end up with seg faults every time. A bit of information on the system : I am working on a 14 nodes cluster. Every node is an Intel Xeon, each composed of 2 sockets with 10 cores each (so 20 cores per node). The nodes are Infiniband connected. The job scheduler is Grid Engine as mentioned before. Since I don't have the hand on the cluster administration, and the "default" installation of openMPI is an old one, I compiled and installed myself Open-MPI 1.8.6 and prepended paths (general and library) to ensure the use of my version of mpi. Open MPI has been configured with the flags --with-sge, and grepping grid engine in ompi_info returns something that looks correct : MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.8.6) Now when running a simple script, displaying the hostname, on two slots binded on one single node, I get the following message : [galaxy1:44361] Error: unknown option "--hnp-topo-sig" Segmentation fault -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- When I connect to the specific host crashing and try to run the program by hand with mpirun, the whole thing executes without problem. I made sure the libraries and path are right, that I have the rights on the node, that /tmp is accessible. I don't think the fourth point of the list is the problem, as for the last one, I suppose that if I can access the node by sshing it, SGE shouldn't have a problem with it as well ... My guess is then a problem from SGE or the integration of OpenMPI with SGE ... I googled with no real success the "hnp-topo-sig", and only got to a stackoverflow page indicating that the problem should be nodes running a different version of OpenMPI. I know that there is an old OpenMPI version by default on the nodes, but shouldn't prepending the paths and exporting the environment (using the -V flag in the script) be sufficient to ensure the right version of openMPI is used ? A bit of additional information, qconf -se orte : pe_name orte slots 2000 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE qsort_args NONE You will find attached the compressed log of ompi_info -a --parsable Thank you very much in advance for any suggestion, MD