I believe qrsh will execute a simple env command across the allocated nodes - 
have you looked into that?

The bottom line is that you simply are not getting the right orted on the 
remote nodes - you are getting the old one, which doesn’t recognize the new 
command line option that mpirun is giving.

Try adding —prefix=<install-point> to your mpirun cmd line. This will force the 
path and ld_library_path to the correct value when executing the orted

Also, you should probably add —enable-orterun-prefix-by-default to your 
configure line to avoid having to add anything to the mpirun cmd line


> On Jul 23, 2015, at 8:08 AM, m.delo...@surrey.ac.uk wrote:
> 
> hi, 
> 
> Thanks for the quick answer.
> I am actually using the module environment, and made my own module for 
> openmpi-1.8.6 prepending the paths.
> 
> I was so desperate to get the env right that I doubled everything : my script 
> is running with the -V flag, I am loading the modules, and printing the env. 
> This returns the right PATH and LD_LIBRARY_PATH
> The problem is that printing the env before mpirun will give me the 
> environment of the master node running mpirun but not the nodes where the 
> program will really be executed.
> On the other hand, if I just try to put the env in a mpirun, then the whole 
> thing segfaults as previously.
> 
> So I am not sure I have a proper way to ensure my env variable are right.
> 
> MD
> 
> From: users <users-boun...@open-mpi.org> on behalf of John Hearns 
> <hear...@googlemail.com>
> Sent: Thursday, July 23, 2015 3:53 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] SGE segfaulting with OpenMPI 1.8.6
>  
> You say that you can run the code OK 'by hand' with an mpirun.
> 
> Are you assuming somehow that the Gridengine jobs will inherit your 
> environment variables, paths etc?
> If I remember correctly, you should submit wiht the  -V  option to pass over 
> environment settings.
> Even better, make sure that the job script itself sets all the paths and 
> variables.
> Have you looked at using the 'modules' environment?
> 
> Also submit a job script and put the 'env' command in as the first command.
> Look at your output closely.
> 
> 
> 
> 
> On 23 July 2015 at 15:45, <m.delo...@surrey.ac.uk 
> <mailto:m.delo...@surrey.ac.uk>> wrote:
> Hello, 
> 
> I have been working on this problem for the last week, browsing the help and 
> the mailing list with no success.
> While trying to run MPI programs using SGE, I end up with seg faults every 
> time.
> 
> A bit of information on the system :
> 
> I am working on a 14 nodes cluster. Every node is an Intel Xeon, each 
> composed of 2 sockets with 10 cores each (so 20 cores per node). The nodes 
> are Infiniband connected. The job scheduler is Grid Engine as mentioned 
> before.
> Since I don't have the hand on the cluster administration, and the "default" 
> installation of openMPI is an old one, I compiled and installed myself 
> Open-MPI 1.8.6 and prepended paths (general and library) to ensure the use of 
> my version of mpi.
> 
> Open MPI has been configured with the flags --with-sge, and grepping grid 
> engine in ompi_info returns something that looks correct :
> 
> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.8.6)
> 
> 
> Now when running a simple script, displaying the hostname, on two slots 
> binded on one single node, I get the following message :
> 
> [galaxy1:44361] Error: unknown option "--hnp-topo-sig"
> Segmentation fault
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --------------------------------------------------------------------------
> 
> 
> When I connect to the specific host crashing and try to run the program by 
> hand with mpirun, the whole thing executes without problem.
> I made sure the libraries and path are right, that I have the rights on the 
> node, that /tmp is accessible. I don't think the fourth point of the list is 
> the problem, as for the last one, I suppose that if I can access the node by 
> sshing it, SGE shouldn't have a problem with it as well ...
> 
> My guess is then a problem from SGE or the integration of OpenMPI with SGE 
> ... 
> 
> I googled with no real success the "hnp-topo-sig", and only got to a 
> stackoverflow page indicating that the problem should be nodes running a 
> different version of OpenMPI. 
> I know that there is an old OpenMPI version by default on the nodes, but 
> shouldn't prepending the paths and exporting the environment (using the -V 
> flag in the script) be sufficient to ensure the right version of openMPI is 
> used ?
> 
> A bit of additional information, 
> 
> qconf -se orte :
> 
> pe_name            orte
> slots              2000
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
> qsort_args         NONE
> 
> 
> You will find attached the compressed log of ompi_info -a --parsable
> 
> 
> 
> Thank you very much in advance for any suggestion, 
> 
> 
> MD
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/07/27312.php 
> <http://www.open-mpi.org/community/lists/users/2015/07/27312.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/07/27314.php

Reply via email to