Thanks for your reply. First of all, I can run this job on multiple nodes without Torque/SGE resource manager, and also ok used with Torque. But this job does not work on multiple nodes with gridengine. I doubt that this is caused by the parallel environment of gridengine. However, orte, mpi, mpich, I got the same error for these PEs of gridengine.
I answer your above mentioned question. >Can you please post the output of the $PE_HOSTFILE and the converted test.nodes for a run, and the allocation you got: qstat -g t The output of $PE_HOSTFILE: compute-0-34.local 16 bgmnode.q@compute-0-34.local UNDEFINED compute-0-67.local 8 bgmnode.q@compute-0-67.local UNDEFINED The converted test.nodes: $ cat test.nodes compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-34 compute-0-67 compute-0-67 compute-0-67 compute-0-67 compute-0-67 compute-0-67 compute-0-67 compute-0-67 $ qstat -g t job-ID prior name user state submit/start at queue master ja-task-ID ------------------------------------------------------------ ------------------------------------------------------ 84462 0.60500 test wuy r 04/07/2017 21:37:18 bgmnode.q@compute-0-34.local MASTER bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE bgmnode.q@compute-0-34.local SLAVE 84462 0.60500 test wuy r 04/07/2017 21:37:18 bgmnode.q@compute-0-67.local SLAVE bgmnode.q@compute-0-67.local SLAVE bgmnode.q@compute-0-67.local SLAVE bgmnode.q@compute-0-67.local SLAVE bgmnode.q@compute-0-67.local SLAVE bgmnode.q@compute-0-67.local SLAVE bgmnode.q@compute-0-67.local SLAVE bgmnode.q@compute-0-67.local SLAVE > The "mpivars.sh" seems not to be in the default Open MPI compilation. Where is it coming from, what's inside? The "mpivars.sh" is touched by me, and the content: $ cat /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh # PATH if test -z "`echo $PATH | grep /share/apps/mpi/openmpi2.0.2-ifort/bin`"; then PATH=/share/apps/mpi/openmpi2.0.2-ifort/bin:${PATH} export PATH fi # LD_LIBRARY_PATH if test -z "`echo $LD_LIBRARY_PATH | grep /share/apps/mpi/openmpi2.0.2-ifort/lib`"; then LD_LIBRARY_PATH=/share/apps/mpi/openmpi2.0.2-ifort/lib${ LD_LIBRARY_PATH:+:}${LD_LIBRARY_PATH} export LD_LIBRARY_PATH fi # MANPATH if test -z "`echo $MANPATH | grep /share/apps/mpi/openmpi2.0.2-ifort/share/man`"; then MANPATH=/share/apps/mpi/openmpi2.0.2-ifort/share/man:${MANPATH} export MANPATH fi >Did you compile Open MPI with the "--with-sge" in the ./configure step? Yes, I configure it with Intel compiler and with the option "--with-sge" $ module load intel/compiler/2011.7.256 $ source /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh $ ompi_info | grep gridengine MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.0.2) >Side note: I create the same directory on each nodes and also use the NFS shared directory for scratch directory. And use the following environment: source /usr/share/Modules/init/sh module load intel/compiler/2011.7.256 source /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh export RSH_COMMAND="ssh" Use these environments, I can run this orca job normally on multiple nodes without gridengine by type the command:"/share/apps/orca4.0.0/orca test.inp &>test.log &" But use the gridengine resource manager, I got the error: -------------------------------------------------------------------------- A hostfile was provided that contains at least one node not present in the allocation: hostfile: test.nodes node: compute-0-67 If you are operating in a resource-managed environment, then only nodes that are in the allocation can be used in the hostfile. You may find relative node syntax to be a useful alternative to specifying absolute node names see the orte_hosts man page for further information. -------------------------------------------------------------------------- I do not know why. Best regards, Yong Wu 2017-04-07 20:14 GMT+08:00 Reuti <re...@staff.uni-marburg.de>: > Hi, > > > Am 07.04.2017 um 09:42 schrieb Yong Wu <wuy...@gmail.com>: > > > > Hi all, > > I submit a parallel ORCA (Quantum Chemistry Program) job on multiple > nodes in Rocks SGE, and get the follow error information, > > ------------------------------------------------------------ > -------------- > > A hostfile was provided that contains at least one node not > > present in the allocation: > > > > hostfile: test.nodes > > node: compute-0-67 > > > > If you are operating in a resource-managed environment, then only > > nodes that are in the allocation can be used in the hostfile. You > > may find relative node syntax to be a useful alternative to > > specifying absolute node names see the orte_hosts man page for > > further information. > > ------------------------------------------------------------ > -------------- > > Although a nodefile is not necessary, it might point to a bug in Open MPI > - see below to get rid of it. Can you please post the output of the > $PE_HOSTFILE and the converted test.nodes for a run, and the allocation you > got: > > qstat -g t > > (You can limit the output to your user account and all the lines belonging > to the job in question.) > > > > The ORCA program compiled with openmpi, here, I used orte parallel > environment in Rocks SGE. > > Well, you can decide whether I answer here or on the ORCA list ;-) > > > > $ qconf -sp orte > > pe_name orte > > slots 9999 > > user_lists NONE > > xuser_lists NONE > > start_proc_args /bin/true > > stop_proc_args /bin/true > > allocation_rule $fill_up > > control_slaves TRUE > > job_is_first_task FALSE > > urgency_slots min > > accounting_summary TRUE > > This is fine. > > > > The submitted sge script: > > #!/bin/bash > > # Job submission script: > > # Usage: qsub <this_script> > > # > > #$ -cwd > > #$ -j y > > #$ -o test.sge.o$JOB_ID > > #$ -S /bin/bash > > #$ -N test > > #$ -pe orte 24 > > #$ -l h_vmem=3.67G > > #$ -l h_rt=1240:00:00 > > > > # go to work dir > > cd $SGE_O_WORKDIR > > There is a switch for it: > > #$ -cwd > > > > > > # load the module env for ORCA > > source /usr/share/Modules/init/sh > > module load intel/compiler/2011.7.256 > > source /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh > > The "mpivars.sh" seems not to be in the default Open MPI compilation. > Where is it coming from, what's inside? > > Did you compile Open MPI with the "--with-sge" in the ./configure step? In > case you didn't compile it on your own, you should see something like this: > > $ ompi_info | grep grid > MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component > v2.1.0) > > > > export orcapath=/share/apps/orca4.0.0 > > export RSH_COMMAND="ssh" > > > > #creat scratch dir on nfs dir > > tdir=/home/data/$SGE_O_LOGNAME/$JOB_ID > > mkdir -p $tdir > > > > #cat $PE_HOSTFILE > > > > PeHostfile2MachineFile() > > { > > cat $1 | while read line; do > > # echo $line > > host=`echo $line|cut -f1 -d" "|cut -f1 -d"."` > > nslots=`echo $line|cut -f2 -d" "` > > i=1 > > while [ $i -le $nslots ]; do > > # add here code to map regular hostnames into ATM hostnames > > echo $host > > i=`expr $i + 1` > > done > > done > > } > > > > PeHostfile2MachineFile $PE_HOSTFILE >> $tdir/test.nodes > > In former times, this conversion was done in the start_proc_args. Nowadays > you neither need this conversion, nor any "machines" file, nor the > "test.nodes" file any longer. Open MPI will detect on it's own the correct > number of slots to use on each node. > > There are only some multi-serial computations in ORCA, which need rsh/ssh > and a nodefile (I have to check whether they don't just pull the > information out of a `mpiexec`). > > > > cp ${SGE_O_WORKDIR}/test.inp $tdir > > > > cd $tdir > > Side note: > > In ORCA there seem several types of jobs to exist: > > - some types of ORCA jobs can compute happily in $TMPDIR using the scratch > directory on the nodes (even in case the job needs more than one machine) > - some need a shared scratch directory, like you create here in the shared > /home > - some will start several serial processes on the granted nodes by the > defined $RSH_COMMAND > > -- Reuti > > > > > > echo "ORCA job start at" `date` > > > > time $orcapath/orca test.inp > ${SGE_O_WORKDIR}/test.log > > > > rm ${tdir}/test.inp > > rm ${tdir}/test.*tmp 2>/dev/null > > rm ${tdir}/test.*tmp.* 2>/dev/null > > mv ${tdir}/test.* $SGE_O_WORKDIR > > > > echo "ORCA job finished at" `date` > > > > echo "Work Dir is : $SGE_O_WORKDIR" > > > > rm -rf $tdir > > rm $SGE_O_WORKDIR/test.sge > > > > > > However, the job can run normally on multiple nodes in Torque. > > > > Can someone help me? Thanks very much! > > > > Best regards! > > Yong Wu > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users