You say that you can run the code OK 'by hand' with an mpirun. Are you assuming somehow that the Gridengine jobs will inherit your environment variables, paths etc? If I remember correctly, you should submit wiht the -V option to pass over environment settings. Even better, make sure that the job script itself sets all the paths and variables. Have you looked at using the 'modules' environment?
Also submit a job script and put the 'env' command in as the first command. Look at your output closely. On 23 July 2015 at 15:45, <m.delo...@surrey.ac.uk> wrote: > Hello, > > > I have been working on this problem for the last week, browsing the help > and the mailing list with no success. > > While trying to run MPI programs using SGE, I end up with seg faults every > time. > > > A bit of information on the system : > > > I am working on a 14 nodes cluster. Every node is an Intel Xeon, each > composed of 2 sockets with 10 cores each (so 20 cores per node). The nodes > are Infiniband connected. The job scheduler is Grid Engine as mentioned > before. > > Since I don't have the hand on the cluster administration, and the > "default" installation of openMPI is an old one, I compiled and installed > myself Open-MPI 1.8.6 and prepended paths (general and library) to ensure > the use of my version of mpi. > > > Open MPI has been configured with the flags --with-sge, and grepping > grid engine in ompi_info returns something that looks correct : > > > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.8.6) > > > > Now when running a simple script, displaying the hostname, on two slots > binded on one single node, I get the following message : > > > [galaxy1:44361] Error: unknown option "--hnp-topo-sig" > > Segmentation fault > > -------------------------------------------------------------------------- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to > use. > > > * compilation of the orted with dynamic libraries when static are > required > > (e.g., on Cray). Please check your configure cmd line and consider using > > one of the contrib/platform definitions for your system type. > > > * an inability to create a connection back to mpirun due to a > > lack of common network interfaces and/or no route found between > > them. Please check network connectivity (including firewalls > > and network routing requirements). > > -------------------------------------------------------------------------- > > > When I connect to the specific host crashing and try to run the program > by hand with mpirun, the whole thing executes without problem. > I made sure the libraries and path are right, that I have the rights on > the node, that /tmp is accessible. I don't think the fourth point of the > list is the problem, as for the last one, I suppose that if I can access > the node by sshing it, SGE shouldn't have a problem with it as well ... > > My guess is then a problem from SGE or the integration of OpenMPI with > SGE ... > > I googled with no real success the "hnp-topo-sig", and only got to a > stackoverflow page indicating that the problem should be nodes running a > different version of OpenMPI. > I know that there is an old OpenMPI version by default on the nodes, but > shouldn't prepending the paths and exporting the environment (using the -V > flag in the script) be sufficient to ensure the right version of openMPI is > used ? > > A bit of additional information, > > qconf -se orte : > > pe_name orte > slots 2000 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > qsort_args NONE > > > You will find attached the compressed log of ompi_info -a --parsable > > > > Thank you very much in advance for any suggestion, > > > MD > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27312.php >