I believe qrsh will execute a simple env command across the allocated nodes - have you looked into that?
The bottom line is that you simply are not getting the right orted on the remote nodes - you are getting the old one, which doesn’t recognize the new command line option that mpirun is giving. Try adding —prefix=<install-point> to your mpirun cmd line. This will force the path and ld_library_path to the correct value when executing the orted Also, you should probably add —enable-orterun-prefix-by-default to your configure line to avoid having to add anything to the mpirun cmd line > On Jul 23, 2015, at 8:08 AM, m.delo...@surrey.ac.uk wrote: > > hi, > > Thanks for the quick answer. > I am actually using the module environment, and made my own module for > openmpi-1.8.6 prepending the paths. > > I was so desperate to get the env right that I doubled everything : my script > is running with the -V flag, I am loading the modules, and printing the env. > This returns the right PATH and LD_LIBRARY_PATH > The problem is that printing the env before mpirun will give me the > environment of the master node running mpirun but not the nodes where the > program will really be executed. > On the other hand, if I just try to put the env in a mpirun, then the whole > thing segfaults as previously. > > So I am not sure I have a proper way to ensure my env variable are right. > > MD > > From: users <users-boun...@open-mpi.org> on behalf of John Hearns > <hear...@googlemail.com> > Sent: Thursday, July 23, 2015 3:53 PM > To: Open MPI Users > Subject: Re: [OMPI users] SGE segfaulting with OpenMPI 1.8.6 > > You say that you can run the code OK 'by hand' with an mpirun. > > Are you assuming somehow that the Gridengine jobs will inherit your > environment variables, paths etc? > If I remember correctly, you should submit wiht the -V option to pass over > environment settings. > Even better, make sure that the job script itself sets all the paths and > variables. > Have you looked at using the 'modules' environment? > > Also submit a job script and put the 'env' command in as the first command. > Look at your output closely. > > > > > On 23 July 2015 at 15:45, <m.delo...@surrey.ac.uk > <mailto:m.delo...@surrey.ac.uk>> wrote: > Hello, > > I have been working on this problem for the last week, browsing the help and > the mailing list with no success. > While trying to run MPI programs using SGE, I end up with seg faults every > time. > > A bit of information on the system : > > I am working on a 14 nodes cluster. Every node is an Intel Xeon, each > composed of 2 sockets with 10 cores each (so 20 cores per node). The nodes > are Infiniband connected. The job scheduler is Grid Engine as mentioned > before. > Since I don't have the hand on the cluster administration, and the "default" > installation of openMPI is an old one, I compiled and installed myself > Open-MPI 1.8.6 and prepended paths (general and library) to ensure the use of > my version of mpi. > > Open MPI has been configured with the flags --with-sge, and grepping grid > engine in ompi_info returns something that looks correct : > > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.8.6) > > > Now when running a simple script, displaying the hostname, on two slots > binded on one single node, I get the following message : > > [galaxy1:44361] Error: unknown option "--hnp-topo-sig" > Segmentation fault > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > -------------------------------------------------------------------------- > > > When I connect to the specific host crashing and try to run the program by > hand with mpirun, the whole thing executes without problem. > I made sure the libraries and path are right, that I have the rights on the > node, that /tmp is accessible. I don't think the fourth point of the list is > the problem, as for the last one, I suppose that if I can access the node by > sshing it, SGE shouldn't have a problem with it as well ... > > My guess is then a problem from SGE or the integration of OpenMPI with SGE > ... > > I googled with no real success the "hnp-topo-sig", and only got to a > stackoverflow page indicating that the problem should be nodes running a > different version of OpenMPI. > I know that there is an old OpenMPI version by default on the nodes, but > shouldn't prepending the paths and exporting the environment (using the -V > flag in the script) be sufficient to ensure the right version of openMPI is > used ? > > A bit of additional information, > > qconf -se orte : > > pe_name orte > slots 2000 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > qsort_args NONE > > > You will find attached the compressed log of ompi_info -a --parsable > > > > Thank you very much in advance for any suggestion, > > > MD > > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27312.php > <http://www.open-mpi.org/community/lists/users/2015/07/27312.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27314.php