Hi,

> Am 07.04.2017 um 09:42 schrieb Yong Wu <wuy...@gmail.com>:
> 
> Hi all,
>   I submit a parallel ORCA (Quantum Chemistry Program) job on multiple nodes 
> in Rocks SGE, and get the follow error information,
> --------------------------------------------------------------------------
> A hostfile was provided that contains at least one node not
> present in the allocation:
> 
>   hostfile:  test.nodes
>   node:      compute-0-67
> 
> If you are operating in a resource-managed environment, then only
> nodes that are in the allocation can be used in the hostfile. You
> may find relative node syntax to be a useful alternative to
> specifying absolute node names see the orte_hosts man page for
> further information.
> --------------------------------------------------------------------------

Although a nodefile is not necessary, it might point to a bug in Open MPI - see 
below to get rid of it. Can you please post the output of the $PE_HOSTFILE and 
the converted test.nodes for a run, and the allocation you got:

qstat -g t

(You can limit the output to your user account and all the lines belonging to 
the job in question.)


> The ORCA program compiled with openmpi, here, I used orte parallel 
> environment in Rocks SGE.

Well, you can decide whether I answer here or on the ORCA list ;-)


> $ qconf -sp orte
> pe_name            orte
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary TRUE

This is fine.


> The submitted sge script:
>   #!/bin/bash
>   # Job submission script:
>   # Usage: qsub <this_script>
>   #
>   #$ -cwd
>   #$ -j y
>   #$ -o test.sge.o$JOB_ID
>   #$ -S /bin/bash
>   #$ -N test
>   #$ -pe orte 24
>   #$ -l h_vmem=3.67G
>   #$ -l h_rt=1240:00:00
> 
>   # go to work dir
>   cd $SGE_O_WORKDIR

There is a switch for it:

#$ -cwd


> 
>   # load the module env for ORCA
>   source /usr/share/Modules/init/sh
>   module load intel/compiler/2011.7.256
>   source /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh

The "mpivars.sh" seems not to be in the default Open MPI compilation. Where is 
it coming from, what's inside?

Did you compile Open MPI with the "--with-sge" in the ./configure step? In case 
you didn't compile it on your own, you should see something like this:

$ ompi_info | grep grid
                 MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.1.0)


>   export orcapath=/share/apps/orca4.0.0
>   export RSH_COMMAND="ssh"
> 
>   #creat scratch dir on nfs dir
>   tdir=/home/data/$SGE_O_LOGNAME/$JOB_ID
>   mkdir -p $tdir
> 
>   #cat $PE_HOSTFILE
> 
>   PeHostfile2MachineFile()
>   {
>      cat $1 | while read line; do
>         # echo $line
>         host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
>         nslots=`echo $line|cut -f2 -d" "`
>         i=1
>         while [ $i -le $nslots ]; do
>            # add here code to map regular hostnames into ATM hostnames
>            echo $host
>            i=`expr $i + 1`
>         done
>      done
>   }
> 
>   PeHostfile2MachineFile $PE_HOSTFILE >> $tdir/test.nodes

In former times, this conversion was done in the start_proc_args. Nowadays you 
neither need this conversion, nor any "machines" file, nor the "test.nodes" 
file any longer. Open MPI will detect on it's own the correct number of slots 
to use on each node.

There are only some multi-serial computations in ORCA, which need rsh/ssh and a 
nodefile (I have to check whether they don't just pull the information out of a 
`mpiexec`).


>   cp ${SGE_O_WORKDIR}/test.inp $tdir
> 
>   cd $tdir

Side note:

In ORCA there seem several types of jobs to exist:

- some types of ORCA jobs can compute happily in $TMPDIR using the scratch 
directory on the nodes (even in case the job needs more than one machine)
- some need a shared scratch directory, like you create here in the shared /home
- some will start several serial processes on the granted nodes by the defined 
$RSH_COMMAND

-- Reuti


> 
>   echo "ORCA job start at" `date`
> 
>   time $orcapath/orca test.inp > ${SGE_O_WORKDIR}/test.log
> 
>   rm ${tdir}/test.inp
>   rm ${tdir}/test.*tmp 2>/dev/null
>   rm ${tdir}/test.*tmp.* 2>/dev/null
>   mv ${tdir}/test.* $SGE_O_WORKDIR
> 
>   echo "ORCA job finished at" `date`
> 
>   echo "Work Dir is : $SGE_O_WORKDIR"
> 
>   rm -rf $tdir
>   rm $SGE_O_WORKDIR/test.sge
> 
> 
> However, the job can run normally on multiple nodes in Torque.
> 
> Can someone help me? Thanks very much!
> 
> Best regards!
> Yong Wu
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to