The error handler wouldn't be called in that situation - we simply abort the
job. We expect to provide that integration in something like the 1.7.4 release
milestone.
On Aug 10, 2013, at 11:07 AM, Edson Tavares de Camargo
wrote:
> Hi All,
>
> I was looking for posts about fault tolerant in
Hi Gus,
I think your suggestion sounds good. I'll leave the PBS_NODEFILE intact.
Thank you again for your assistance!
- Lee-Ping
-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gustavo Correa
Sent: Saturday, August 10, 2013 5:36 PM
To: Open MPI Users
Subj
Hi Lee-Ping
Yes, configuring --without-tm, as Ralph told you to do,
will make your OpenMPI independent from Torque, although as Ralph said,
even with an Open MPI configured with Torque support you can override it at
runtime.
I don't know what Open MPI uses the PBS_JOBID for, maybe some internal
Hi Ralph,
Thank you. I didn't know that "--without-tm" was the correct configure
option. I built and reinstalled OpenMPI 1.4.2, and now I no longer need to
set PBS_JOBID for it to recognize the correct machine file. My current
workflow is:
1) Submit a multiple-node batch job.
2) Launch a sepa
It helps if you use the correct configure option: --without-tm
Regardless, you can always deselect Torque support at runtime. Just put the
following in your environment:
OMPI_MCA_ras=^tm
That will tell ORTE to ignore the Torque allocation module and it should then
look at the machinefile.
On
Hi Gus,
I agree that $PBS_JOBID should not point to a file in normal situations,
because it is the job identifier given by the scheduler. However,
ras_tm_module.c actually does search for a file named $PBS_JOBID, and that
seems to be why it was failing. You can see this in the source code as wel
Lee-Ping
Something looks amiss.
PBS_JOBID contains the job name.
PBS_NODEFILE contains a list (with repetitions up to the number of cores) of
the nodes
that torque assigned to the job.
Why things get twisted it is hard to tell, it may be something in the Q-Chem
scripts
(could it be mixing up P
Hi Lee-Ping
Is /scratch/leeping/272055.certainty.stanford.edu the actual PBS_NODEFILE
provided by Torque?
You could check this by adding a few lines to the Q-Chem launching scripts:
ls -l $PBS_NODEFILE
cat $PBS_NODEFILE
This can go right before the mpiexec line.
If the OpenMPI with torque suppor
Hi Gus,
It seems the calculation is now working, or at least it didn't crash. I set
the PBS_JOBID environment variable to the name of my custom node file. That
is to say, I set PBS_JOBID=pbs_nodefile.compute-3-3.local. It appears that
ras_tm_module.c is trying to open the file located at
/scrat
Hi Gus,
I tried your suggestions. Here is the command line which executes mpirun.
I was puzzled because it still reported a file open failure, so I inserted a
print statement into ras_tm_module.c and recompiled. The results are below.
As you can see, it tries to open a different file
(/scratch/l
Hi Gus,
Thank you. You gave me many helpful suggestions, which I will try out and
get back to you. I will provide more specifics (e.g. how my jobs were
submitted) in a future email.
As for the queue policy, that is a highly political issue because the
cluster is a shared resource. My usual r
Hi Lee-Ping
On Aug 10, 2013, at 3:15 PM, Lee-Ping Wang wrote:
> Hi Gus,
>
> Thank you for your reply. I want to run MPI jobs inside a single node, but
> due to the resource allocation policies on the clusters, I could get many
> more resources if I submit multiple-node "batch jobs". Once I have
... from a (probably obsolete) Q-Chem user guide found on the Web:
***
" To run parallel Q-Chem
using a batch scheduler such as PBS, users may have to modify the
mpirun command in $QC/bin/parallel.csh
depending on whether or not the MPI implementation requires the
-machinefile option to be given
Hi Gus,
Thank you for your reply. I want to run MPI jobs inside a single node, but
due to the resource allocation policies on the clusters, I could get many
more resources if I submit multiple-node "batch jobs". Once I have a
multiple-node batch job, then I can use a command like "pbsdsh" to run
Hi Lee-Ping
I know nothing about Q-Chem, but I was confused by these sentences:
"That is to say, these tasks are intended to use OpenMPI parallelism on each
node, but no parallelism across nodes. "
"I do not observe this error when submitting single-node jobs."
"Since my jobs are only parallel
Hi All,
I was looking for posts about fault tolerant in MPI and I found the post
below:
http://www.open-mpi.org/community/lists/users/2012/06/19658.php
I am trying to understand all work about failures detection present in
open-mpi. So, I began with a simple application, a ring application
(rin
Hi there,
Recently, I've begun some calculations on a cluster where I submit a
multiple node job to the Torque batch system, and the job executes multiple
single-node parallel tasks. That is to say, these tasks are intended to use
OpenMPI parallelism on each node, but no parallelism across node
17 matches
Mail list logo