... from a (probably obsolete) Q-Chem user guide found on the Web: *** " To run parallel Q-Chem using a batch scheduler such as PBS, users may have to modify the mpirun command in $QC/bin/parallel.csh depending on whether or not the MPI implementation requires the -machinefile option to be given. For further details users should read the $QC/README.Parallelle, and contact Q-Chem if any problems are encountered (email: support@q- chem.com). Parallel users should also read the above section on using serial Q-Chem. Users can also run Q-Chem's coupled-cluster calculations in parallel on multi-core architectures. Please see section 5.12 for detail" ***
Guesses: 1) Q-Chem is launched by a bunch of scripts provided by Q-chem.com or something the like, and the mpiexec command line is buried somewhere in those scripts, not directly visible by the user. Right? 2) Look for the -machinefile switch in their script $QC/bin/parallel.csh and replace it by -hostfile $PBS_NODEFILE. My two cents, Gus Correa On Aug 10, 2013, at 3:03 PM, Gustavo Correa wrote: > Hi Lee-Ping > > I know nothing about Q-Chem, but I was confused by these sentences: > > "That is to say, these tasks are intended to use OpenMPI parallelism on each > node, but no parallelism across nodes. " > > "I do not observe this error when submitting single-node jobs." > > "Since my jobs are only parallel over the node they’re running on, I believe > that a node file of any kind is unnecessary. " > > Are you trying to run MPI jobs across several nodes or inside a single node? > > *** > > Anyway, as far as I know, > if your OpenMPI was compiled with Torque/PBS support, the mpiexec/mpirun > command > will look for the $PBS_NODEFILE to learn in which node(s) it should launch > the MPI > processes, regardless of whether you are using one node or more than one node. > > You didn't send your mpiexec command line (which would help), but assuming > that > Q-Chem allows some level of standard mpiexec command options, you could force > passing the $PBS_NODEFILE to it. > > Something like this (for two nodes with 8 cores each): > > #PBS -q myqueue > #PBS -l nodes=2:ppn=8 > #PBS -N myjob > cd $PBS_O_WORKDIR > ls -l $PBS_NODEFILE > cat $PBS_NODEFILE > > mpiexec -hostfile $PBS_NODEFILE -np 16 ./my-Q-chem-executable <parameters to > Q-chem> > > I hope this helps, > Gus Correa > > On Aug 10, 2013, at 1:51 PM, Lee-Ping Wang wrote: > >> Hi there, >> >> Recently, I’ve begun some calculations on a cluster where I submit a >> multiple node job to the Torque batch system, and the job executes multiple >> single-node parallel tasks. That is to say, these tasks are intended to use >> OpenMPI parallelism on each node, but no parallelism across nodes. >> >> Some background: The actual program being executed is Q-Chem 4.0. I use >> OpenMPI 1.4.2 for this, because Q-Chem is notoriously difficult to compile >> and this is the last known version of OpenMPI that this version of Q-Chem is >> known to work with. >> >> My jobs are failing with the error message below; I do not observe this >> error when submitting single-node jobs. From reading the mailing list >> archives (http://www.open-mpi.org/community/lists/users/2010/03/12348.php), >> I believe it is looking for a PBS node file somewhere. Since my jobs are >> only parallel over the node they’re running on, I believe that a node file >> of any kind is unnecessary. >> >> My question is: Why is OpenMPI behaving differently when I submit a >> multi-node job compared to a single-node job? How does OpenMPI detect that >> it is running under a multi-node allocation? Is there a way I can change >> OpenMPI’s behavior so it always thinks it’s running on a single node, >> regardless of the type of job I submit to the batch system? >> >> Thank you, >> >> - Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford University) >> >> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in >> file ras_tm_module.c at line 153 >> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in >> file ras_tm_module.c at line 153 >> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in >> file ras_tm_module.c at line 153 >> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in >> file ras_tm_module.c at line 87 >> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in >> file ras_tm_module.c at line 87 >> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in >> file ras_tm_module.c at line 87 >> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in >> file base/ras_base_allocate.c at line 133 >> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in >> file base/ras_base_allocate.c at line 133 >> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in >> file base/ras_base_allocate.c at line 133 >> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in >> file base/plm_base_launch_support.c at line 72 >> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in >> file base/plm_base_launch_support.c at line 72 >> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in >> file base/plm_base_launch_support.c at line 72 >> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in >> file plm_tm_module.c at line 167 >> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in >> file plm_tm_module.c at line 167 >> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in >> file plm_tm_module.c at line 167 >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users