Hi,
Thanks a lot, it seems to be working using the --prefix on mpirun. I have trouble understanding why using the right flags or exporting "by hand" with -x PATH -x LD_LIBRARY_PATH do not work. In any case, the --prefix option works so that's a good starting point. Thank you again, MD ________________________________ From: users <users-boun...@open-mpi.org> on behalf of Ralph Castain <r...@open-mpi.org> Sent: Friday, July 24, 2015 1:51 AM To: Open MPI Users Subject: Re: [OMPI users] SGE segfaulting with OpenMPI 1.8.6 I believe qrsh will execute a simple env command across the allocated nodes - have you looked into that? The bottom line is that you simply are not getting the right orted on the remote nodes - you are getting the old one, which doesn’t recognize the new command line option that mpirun is giving. Try adding —prefix=<install-point> to your mpirun cmd line. This will force the path and ld_library_path to the correct value when executing the orted Also, you should probably add —enable-orterun-prefix-by-default to your configure line to avoid having to add anything to the mpirun cmd line On Jul 23, 2015, at 8:08 AM, m.delo...@surrey.ac.uk<mailto:m.delo...@surrey.ac.uk> wrote: hi, Thanks for the quick answer. I am actually using the module environment, and made my own module for openmpi-1.8.6 prepending the paths. I was so desperate to get the env right that I doubled everything : my script is running with the -V flag, I am loading the modules, and printing the env. This returns the right PATH and LD_LIBRARY_PATH The problem is that printing the env before mpirun will give me the environment of the master node running mpirun but not the nodes where the program will really be executed. On the other hand, if I just try to put the env in a mpirun, then the whole thing segfaults as previously. So I am not sure I have a proper way to ensure my env variable are right. MD ________________________________ From: users <users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>> on behalf of John Hearns <hear...@googlemail.com<mailto:hear...@googlemail.com>> Sent: Thursday, July 23, 2015 3:53 PM To: Open MPI Users Subject: Re: [OMPI users] SGE segfaulting with OpenMPI 1.8.6 You say that you can run the code OK 'by hand' with an mpirun. Are you assuming somehow that the Gridengine jobs will inherit your environment variables, paths etc? If I remember correctly, you should submit wiht the -V option to pass over environment settings. Even better, make sure that the job script itself sets all the paths and variables. Have you looked at using the 'modules' environment? Also submit a job script and put the 'env' command in as the first command. Look at your output closely. On 23 July 2015 at 15:45, <m.delo...@surrey.ac.uk<mailto:m.delo...@surrey.ac.uk>> wrote: Hello, I have been working on this problem for the last week, browsing the help and the mailing list with no success. While trying to run MPI programs using SGE, I end up with seg faults every time. A bit of information on the system : I am working on a 14 nodes cluster. Every node is an Intel Xeon, each composed of 2 sockets with 10 cores each (so 20 cores per node). The nodes are Infiniband connected. The job scheduler is Grid Engine as mentioned before. Since I don't have the hand on the cluster administration, and the "default" installation of openMPI is an old one, I compiled and installed myself Open-MPI 1.8.6 and prepended paths (general and library) to ensure the use of my version of mpi. Open MPI has been configured with the flags --with-sge, and grepping grid engine in ompi_info returns something that looks correct : MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.8.6) Now when running a simple script, displaying the hostname, on two slots binded on one single node, I get the following message : [galaxy1:44361] Error: unknown option "--hnp-topo-sig" Segmentation fault -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- When I connect to the specific host crashing and try to run the program by hand with mpirun, the whole thing executes without problem. I made sure the libraries and path are right, that I have the rights on the node, that /tmp is accessible. I don't think the fourth point of the list is the problem, as for the last one, I suppose that if I can access the node by sshing it, SGE shouldn't have a problem with it as well ... My guess is then a problem from SGE or the integration of OpenMPI with SGE ... I googled with no real success the "hnp-topo-sig", and only got to a stackoverflow page indicating that the problem should be nodes running a different version of OpenMPI. I know that there is an old OpenMPI version by default on the nodes, but shouldn't prepending the paths and exporting the environment (using the -V flag in the script) be sufficient to ensure the right version of openMPI is used ? A bit of additional information, qconf -se orte : pe_name orte slots 2000 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE qsort_args NONE You will find attached the compressed log of ompi_info -a --parsable Thank you very much in advance for any suggestion, MD _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27312.php _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27314.php