Hi, > Am 07.04.2017 um 09:42 schrieb Yong Wu <wuy...@gmail.com>: > > Hi all, > I submit a parallel ORCA (Quantum Chemistry Program) job on multiple nodes > in Rocks SGE, and get the follow error information, > -------------------------------------------------------------------------- > A hostfile was provided that contains at least one node not > present in the allocation: > > hostfile: test.nodes > node: compute-0-67 > > If you are operating in a resource-managed environment, then only > nodes that are in the allocation can be used in the hostfile. You > may find relative node syntax to be a useful alternative to > specifying absolute node names see the orte_hosts man page for > further information. > --------------------------------------------------------------------------
Although a nodefile is not necessary, it might point to a bug in Open MPI - see below to get rid of it. Can you please post the output of the $PE_HOSTFILE and the converted test.nodes for a run, and the allocation you got: qstat -g t (You can limit the output to your user account and all the lines belonging to the job in question.) > The ORCA program compiled with openmpi, here, I used orte parallel > environment in Rocks SGE. Well, you can decide whether I answer here or on the ORCA list ;-) > $ qconf -sp orte > pe_name orte > slots 9999 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary TRUE This is fine. > The submitted sge script: > #!/bin/bash > # Job submission script: > # Usage: qsub <this_script> > # > #$ -cwd > #$ -j y > #$ -o test.sge.o$JOB_ID > #$ -S /bin/bash > #$ -N test > #$ -pe orte 24 > #$ -l h_vmem=3.67G > #$ -l h_rt=1240:00:00 > > # go to work dir > cd $SGE_O_WORKDIR There is a switch for it: #$ -cwd > > # load the module env for ORCA > source /usr/share/Modules/init/sh > module load intel/compiler/2011.7.256 > source /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh The "mpivars.sh" seems not to be in the default Open MPI compilation. Where is it coming from, what's inside? Did you compile Open MPI with the "--with-sge" in the ./configure step? In case you didn't compile it on your own, you should see something like this: $ ompi_info | grep grid MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.1.0) > export orcapath=/share/apps/orca4.0.0 > export RSH_COMMAND="ssh" > > #creat scratch dir on nfs dir > tdir=/home/data/$SGE_O_LOGNAME/$JOB_ID > mkdir -p $tdir > > #cat $PE_HOSTFILE > > PeHostfile2MachineFile() > { > cat $1 | while read line; do > # echo $line > host=`echo $line|cut -f1 -d" "|cut -f1 -d"."` > nslots=`echo $line|cut -f2 -d" "` > i=1 > while [ $i -le $nslots ]; do > # add here code to map regular hostnames into ATM hostnames > echo $host > i=`expr $i + 1` > done > done > } > > PeHostfile2MachineFile $PE_HOSTFILE >> $tdir/test.nodes In former times, this conversion was done in the start_proc_args. Nowadays you neither need this conversion, nor any "machines" file, nor the "test.nodes" file any longer. Open MPI will detect on it's own the correct number of slots to use on each node. There are only some multi-serial computations in ORCA, which need rsh/ssh and a nodefile (I have to check whether they don't just pull the information out of a `mpiexec`). > cp ${SGE_O_WORKDIR}/test.inp $tdir > > cd $tdir Side note: In ORCA there seem several types of jobs to exist: - some types of ORCA jobs can compute happily in $TMPDIR using the scratch directory on the nodes (even in case the job needs more than one machine) - some need a shared scratch directory, like you create here in the shared /home - some will start several serial processes on the granted nodes by the defined $RSH_COMMAND -- Reuti > > echo "ORCA job start at" `date` > > time $orcapath/orca test.inp > ${SGE_O_WORKDIR}/test.log > > rm ${tdir}/test.inp > rm ${tdir}/test.*tmp 2>/dev/null > rm ${tdir}/test.*tmp.* 2>/dev/null > mv ${tdir}/test.* $SGE_O_WORKDIR > > echo "ORCA job finished at" `date` > > echo "Work Dir is : $SGE_O_WORKDIR" > > rm -rf $tdir > rm $SGE_O_WORKDIR/test.sge > > > However, the job can run normally on multiple nodes in Torque. > > Can someone help me? Thanks very much! > > Best regards! > Yong Wu > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users