We use Torque with OMPI here on almost every cluster, running 64-bit
jobs with the Intel compilers, so I doubt the problem is with Torque.
It is probably an issue with library paths.
Torque doesn't automatically forward your environment, nor does it
execute your remote .bashrc (or equivalent) when starting your remote
process. While ssh also typically doesn't forward the environment
(though your sys admin may have set it up to do so), it does execute
the remote .bashrc, which could be setting the correct path. I should
also note that mpirun will automatically forward LD_LIBRARY_PATH and
PATH for you, which is something different from what we do for the
other launchers.
If you execute your MPI_Ii_64 program locally on each of your nodes
(i.e., both processes run local), does it work? If so, then try adding
-x LD_LIBRARY_PATH
to your mpirun cmd line. This will tell mpirun to pickup your local
lib-path and forward it for you regardless of the launch environment.
On Aug 11, 2009, at 10:17 PM, Sims, James S. Dr. wrote:
Back to this problem.
The last suggestion was to upgrade to 1.3.3, which has been done.
Still cannot get this code to
run in 64 bit mode with torque. What I can do is run the job in l6
bit mode using a hostfile.
Specifically, if I use
qsub -I -l nodes=2:ppn=1 torque allocates two nodes to the job, and
since this is an interactive
shell, logs me in to the controlling node. In this example process
rank 0 is n72 and process rank 1 is n89:
[sims@n72 4000]$ mpirun --display-allocation -pernode --display-map
hostname
====================== ALLOCATED NODES ======================
Data for node: Name: n72.clust.nist.gov Num slots: 1 Max
slots: 0
Data for node: Name: n89 Num slots: 1 Max slots: 0
=================================================================
======================== JOB MAP ========================
Data for node: Name: n72.clust.nist.gov Num procs: 1
Process OMPI jobid: [47657,1] Process rank: 0
Data for node: Name: n89 Num procs: 1
Process OMPI jobid: [47657,1] Process rank: 1
=============================================================
n89
n72.clust.nist.gov
My hostfile is
[sims@n72 4000]$ cat hostfile
n72
n89
If, logged in to n72, I use the command
mpirun -np 2 ./MPI_li_64
the job fails with a
mpirun noticed that process rank 1 with PID 10538 on node n89 exited
on signal 11 (Segmentation fault).
If I use the command
mpirun -np 2 --hostfile hostfile ./MPI_li_64
the same thing happens.
However, if I ssh to n73, for example, and use the command
mpirun -np 2 --hostfile hostfile ./MPI_li_64
everything works fine. So it appears that the problem is with torque.
Any ideas?
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users