Good morning Vipul. I would like to ask some higher level questions regarding your HPC cluster. What are the manufacturers of the cluster nodes. How many compute nodes? What network interconnect do you have - gigabit ethernet, 10gig ethernet, Infiniband, Omnipath? Which cluster middleware - openHPC? Rocks? Bright? Qlustar? Which version of grid - there have been MANY versions of this over the years. Who installed the cluster ?
And now the big question - and everyone on the list will laugh at me for this.... Would you consider switching to using the Slurm batch queuing system? On Sat, 30 May 2020 at 00:41, Kulshrestha, Vipul via users < users@lists.open-mpi.org> wrote: > Hi, > > > > I need to launch my openmpi application on grid. My application is > designed to run N processes, where each process would have M threads. I am > using open MPI version 4.0.1 > > > > % /build/openmpi/openmpi-4.0.1/rhel6/bin/ompi_info | grep grid > > MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component > v4.0.1) > > > > To run it without grid, I run it as (say N = 7, M = 2) > > % mpirun –np 7 <application with arguments> > > > > The above works well and runs N processes. Based on some earlier advice on > this forum, I have setup the grid submission using the a grid job > submission script that modifies the grid slow allocation, so that mpirun > launches only 1 application process copy on each host allocated by grid. I > have some partial success. I think grid is able to start the job and then > mpirun also starts to run, but then it errors out with below mentioned > errors. Strangely, after giving message for having started all the daemons, > it reports that it was not able to start one or more daemons. > > > > I have setup a grid submission script that modifies the pe_hostfile and it > appears that mpirun is able to take it and then is able use the host > information to start launching the jobs. However, mpirun halts before it > can start all the child processes. I enabled some debug logs but am not > able to figure out a possible cause. > > > > Could somebody look at this and guide how to resolve this issue? > > > > I have pasted the detailed log as well as my job submission script below. > > > > As a clarification, when I run the mpirun without grid, it (mpirun and my > application) works on the same set of hosts without any problems. > > > > Thanks, > > Vipul > > > > Job submission script: > > #!/bin/sh > > #$ -N velsyn > > #$ -pe orte2 14 > > #$ -V -cwd -j y > > #$ -o out.txt > > # > > echo "Got $NSLOTS slots." > > echo "tmpdir is $TMPDIR" > > echo "pe_hostfile is $PE_HOSTFILE" > > > > > > cat $PE_HOSTFILE > > newhostfile=/testdir/tmp/pe_hostfile > > > > awk '{$2 = $2/2; print}' $PE_HOSTFILE > $newhostfile > > > > export PE_HOSTFILE=$newhostfile > > export LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib > > > > mpirun --merge-stderr-to-stdout --output-filename ./output:nojobid,nocopy > --mca routed direct --mca orte_base_help_aggregate 0 --mca plm_base_verbose > 1 --bind-to none --report-bindings -np 7 <application with args> > > > > The out.txt content is: > > Got 14 slots. > > tmpdir is /tmp/182117160.1.all.q > > pe_hostfile is /var/spool/sge/bos2/active_jobs/182117160.1/pe_hostfile > > bos2.wv.org.com 2 al...@bos2.wv.org.com <NULL> art8.wv.org.com 2 > al...@art8.wv.org.com <NULL> art10.wv.org.com 2 al...@art10.wv.org.com > <NULL> hpb7.wv.org.com 2 al...@hpb7.wv.org.com <NULL> bos15.wv.org.com 2 > al...@bos15.wv.org.com <NULL> bos1.wv.org.com 2 al...@bos1.wv.org.com > <NULL> hpb11.wv.org.com 2 al...@hpb11.wv.org.com <NULL> [bos2:22657] > [[8251,0],0] plm:rsh: using "/wv/grid2/sge/bin/lx-amd64/qrsh -inherit > -nostdin -V -verbose" for launching [bos2:22657] [[8251,0],0] plm:rsh: > final template argv: > > /grid2/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose > <template> set path = ( /build/openm > > pi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 ) set > OMPI_have_llp ; if ( $?LD_LIBR ARY_PATH == 0 ) setenv LD_LIBRARY_PATH > /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_llp == 1 ) setenv > LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; > if ( $?DYLD_L IBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( > $?DYLD_LIBRARY_PATH == 0 ) setenv DYLD_LIBRARY_PATH /bui > ld/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_dllp == 1 ) setenv > DYLD_LIBRARY_PATH /build/ope > > nmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ; > /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca > > orte_report_bindings "1" -mca ess "env" -mca ess_base_jobid "540737536" > -mca ess_base_vpid "<templat > > e>" -mca ess_base_num_procs "7" -mca orte_node_regex > > e>"bos[1:2],art[1:8],art[2:10],hpb[1:7],bos[2:15], > > bos[1:1],hpb[2:11]@0(7)" -mca orte_hnp_uri "540737536.0;tcp:// > 147.34.116.60:50769" --mca routed "dire ct" --mca > orte_base_help_aggregate "0" --mca plm_base_verbose "1" -mca plm "rsh" > --tree-spawn -mca or te_parent_uri "540737536.0;tcp://147.34.116.60:50769" > -mca orte_output_filename "./output:nojobid,noc opy" -mca > hwloc_base_binding_policy "none" -mca hwloc_base_report_bindings "1" -mca > pmix "^s1,s2,cray ,isolated" > > Starting server daemon at host "art10" > > Starting server daemon at host "art8" > > Starting server daemon at host "bos1" > > Starting server daemon at host "hpb7" > > Starting server daemon at host "hpb11" > > Starting server daemon at host "bos15" > > Server daemon successfully started with task id "1.art8" > > Server daemon successfully started with task id "1.bos1" > > Server daemon successfully started with task id "1.art10" > > Server daemon successfully started with task id "1.bos15" > > Server daemon successfully started with task id "1.hpb7" > > Server daemon successfully started with task id "1.hpb11" > > Unmatched ". > > -------------------------------------------------------------------------- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to > use. > > > > * compilation of the orted with dynamic libraries when static are required > > (e.g., on Cray). Please check your configure cmd line and consider using > > one of the contrib/platform definitions for your system type. > > > > * an inability to create a connection back to mpirun due to a > > lack of common network interfaces and/or no route found between > > them. Please check network connectivity (including firewalls > > and network routing requirements). > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > ORTE does not know how to route a message to the specified daemon located > on the indicated node: > > > > my node: bos2 > > target node: art10 > > > > This is usually an internal programming error that should be reported to > the developers. In the meantime, a workaround may be to set the MCA param > routed=direct on the command line or in your environment. We apologize for > the problem. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > ORTE does not know how to route a message to the specified daemon located > on the indicated node: > > > > my node: bos2 > > target node: hpb7 > > > > This is usually an internal programming error that should be reported to > the developers. In the meantime, a workaround may be to set the MCA param > routed=direct on the command line or in your environment. We apologize for > the problem. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > ORTE does not know how to route a message to the specified daemon located > on the indicated node: > > > > my node: bos2 > > target node: bos15 > > > > This is usually an internal programming error that should be reported to > the developers. In the meantime, a workaround may be to set the MCA param > routed=direct on the command line or in your environment. We apologize for > the problem. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > ORTE does not know how to route a message to the specified daemon located > on the indicated node: > > > > my node: bos2 > > target node: bos1 > > > > This is usually an internal programming error that should be reported to > the developers. In the meantime, a workaround may be to set the MCA param > routed=direct on the command line or in your environment. We apologize for > the problem. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > ORTE does not know how to route a message to the specified daemon located > on the indicated node: > > > > my node: bos2 > > target node: hpb11 > > > > This is usually an internal programming error that should be reported to > the developers. In the meantime, a workaround may be to set the MCA param > routed=direct on the command line or in your environment. We apologize for > the problem. > > -------------------------------------------------------------------------- >