Re: [OMPI users] running mpirun with grid

Kulshrestha, Vipul via users Tue, 24 Mar 2020 12:52:28 -0700

Hi,

Any idea about the below problem?


I have verified that the TMPDIR path is writeable.

If the below information is not sufficient, what other information can I 
provide? If there are some additional debug flag that will generate more 
information, I can try those as well.

Appreciate any help I can get on this.

Thanks,
Vipul


-----Original Message-----
From: Kulshrestha, Vipul 
Sent: Thursday, March 19, 2020 5:31 PM
To: 'Reuti' <re...@staff.uni-marburg.de>; Open MPI Users 
<users@lists.open-mpi.org>
Subject: RE: [OMPI users] running mpirun with grid

Hi Reuti,

Finally I was able to understand how to set this up and I have some partial 
success. I think grid is able to start the job and then mpirun also starts to 
run, but then it errors out with below mentioned errors. Strangely, after 
giving message for having started all the daemons, it reports that it was not 
able to start one or more daemons.

I have setup a grid submission script that modifies the pe_hostfile and it 
appears that mpirun is able to take it and then is able use the host 
information to start launching the jobs. However, mpirun halts before it can 
start all the child processes. I enabled some debug logs but am not able to 
figure out a possible cause. 

Could you/somebody look at this and guide how to resolve this issue?

I have pasted the detailed log as well as my job submission script below.

As a clarification, when I run the mpirun without grid, it (mpirun and my 
application) works on the same set of hosts without any problems.

Thanks,
Vipul

Job submission script:
#!/bin/sh
#$ -N velsyn
#$ -pe orte2 14
#$ -V -cwd -j y
#$ -o out.txt
#
echo "Got $NSLOTS slots."
echo "tmpdir is $TMPDIR"
echo "pe_hostfile is $PE_HOSTFILE"


cat $PE_HOSTFILE
newhostfile=/testdir/tmp/pe_hostfile

awk '{$2 = $2/2; print}' $PE_HOSTFILE > $newhostfile

export PE_HOSTFILE=$newhostfile
export LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib

mpirun --merge-stderr-to-stdout --output-filename ./output:nojobid,nocopy --mca 
routed direct --mca orte_base_help_aggregate 0 --mca plm_base_verbose 1 
--bind-to none --report-bindings -np 7 <application with args>

The out.txt content is:
Got 14 slots.
tmpdir is /tmp/182117160.1.all.q
pe_hostfile is /var/spool/sge/bos2/active_jobs/182117160.1/pe_hostfile
bos2.wv.org.com 2 al...@bos2.wv.org.com <NULL>
art8.wv.org.com 2 al...@art8.wv.org.com <NULL>
art10.wv.org.com 2 al...@art10.wv.org.com <NULL>
hpb7.wv.org.com 2 al...@hpb7.wv.org.com <NULL>
bos15.wv.org.com 2 al...@bos15.wv.org.com <NULL>
bos1.wv.org.com 2 al...@bos1.wv.org.com <NULL>
hpb11.wv.org.com 2 al...@hpb11.wv.org.com <NULL>
[bos2:22657] [[8251,0],0] plm:rsh: using "/wv/grid2/sge/bin/lx-amd64/qrsh 
-inherit -nostdin -V -verbose" for launching
[bos2:22657] [[8251,0],0] plm:rsh: final template argv:
  /grid2/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose <template>     set 
path = ( /build/openm
pi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 ) set 
OMPI_have_llp ; if ( $?LD_LIBR
ARY_PATH == 0 ) setenv LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; 
if ( $?OMPI_have_llp
== 1 ) setenv LD_LIBRARY_PATH 
/build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_L
IBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv 
DYLD_LIBRARY_PATH /bui
ld/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_dllp == 1 ) setenv 
DYLD_LIBRARY_PATH /build/ope
nmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ;   
/build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca
 orte_report_bindings "1" -mca ess "env" -mca ess_base_jobid "540737536" -mca 
ess_base_vpid "<templat
e>" -mca ess_base_num_procs "7" -mca orte_node_regex 
"bos[1:2],art[1:8],art[2:10],hpb[1:7],bos[2:15],
bos[1:1],hpb[2:11]@0(7)" -mca orte_hnp_uri 
"540737536.0;tcp://147.34.116.60:50769" --mca routed "dire
ct" --mca orte_base_help_aggregate "0" --mca plm_base_verbose "1" -mca plm 
"rsh" --tree-spawn -mca or
te_parent_uri "540737536.0;tcp://147.34.116.60:50769" -mca orte_output_filename 
"./output:nojobid,noc
opy" -mca hwloc_base_binding_policy "none" -mca hwloc_base_report_bindings "1" 
-mca pmix "^s1,s2,cray
,isolated"
Starting server daemon at host "art10"
Starting server daemon at host "art8"
Starting server daemon at host "bos1"
Starting server daemon at host "hpb7"
Starting server daemon at host "hpb11"
Starting server daemon at host "bos15"
Server daemon successfully started with task id "1.art8"
Server daemon successfully started with task id "1.bos1"
Server daemon successfully started with task id "1.art10"
Server daemon successfully started with task id "1.bos15"
Server daemon successfully started with task id "1.hpb7"
Server daemon successfully started with task id "1.hpb11"
Unmatched ".
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   bos2
  target node:  art10

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   bos2
  target node:  hpb7

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   bos2
  target node:  bos15

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   bos2
  target node:  bos1

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   bos2
  target node:  hpb11

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------







-----Original Message-----
From: Reuti [mailto:re...@staff.uni-marburg.de] 
Sent: Thursday, February 6, 2020 4:35 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com>
Subject: Re: [OMPI users] running mpirun with grid

Hi,

> Am 06.02.2020 um 21:47 schrieb Kulshrestha, Vipul via users 
> <users@lists.open-mpi.org>:
> 
> Hi,
>  
> I need to launch my openmpi application on grid.
>  
> My application is designed to run N processes, where each process would have 
> M threads.
>  
> To run it without grid, I run it as (say N = 7, M = 2) % mpirun –np 7 
> <application name with arguments>
>  
> The above works well and runs N processes. I am also able to submit it on 
> grid using below command and it works.
>  
> % qsub –pe orte 7 –l os-redhat6.7* -V –j y –b y –shell no mpirun –np 7 
> <application name with arguments>
>  
> However, the above job allocates only N slots on the grid, when it really is 
> consuming N*M slots. How do I submit the qsub command so that it reserves the 
> N*M slots, while starting up N processes? I tried belwo but I get some weird 
> error from ORTE as pasted below.
>  
> % qsub –pe orte 14 –l os-redhat6.7* -V –j y –b y –shell no mpirun –np 
> 7 <application name with arguments>

a) You will first have to talk to the admin to provide a fixed allocation rule 
on all involved nodes, hence e.g. "allocation_rule 2" and name this PE "orte2". 
Essentially you can be sure to get always 2 slots on each node this way.

b) Instead of submitting a binary, you will need a job script where you mangle 
the provided PE_HOSTFILE to include each node only with a slot count of 1. I.e. 
Open MPI should think to start only one process per node. You can then use the 
remaining core for an additional thread. As the original file can't be changed, 
it has to be copied, adjusted and then PE_HOSTFILE reset to this new file.

c) It would be nice, if the admin could prepare already a mangled PE_HOSTFILE 
(maybe by dividing the slotcount by the last diigit in the PE name) in a 
parallel prolog and put it in $TMPDIR of the job. As the environemnt variables 
won't be inherited to the job, you will have to point the environment variable 
PE_HOSTFILE to the mangled one in your job script in this case too.

d) SGE should get the real amount of needed slots of your job during 
submission, i.e. 14 in your case.

This way you will get an allocation of 14 slots, due to the fixed allocation 
rule "orte2" they are equally distributed. The mangled PE_HOSTFILE will include 
only one slot per node and Open MPI will start only one process per node for a 
total of 7. Then you can use OMP_NUM_THREAD=2 or alike to tell your application 
to start an additional thread per node. The environment variable OMP_NUM_THREAD 
should also be distributed to the nodes by the option "-x" to `mpirun` (or use 
MPI itself to distribute this information).

Note that in contrast to Torque you get each node only once for sure. AFAIR 
there was a setting in Torque to allow or disallow mutiple elections of the 
fixed allocation rule per node.

HTH -- Reuti

Re: [OMPI users] running mpirun with grid

Reply via email to