We use Torque with OMPI here on almost every cluster, running 64-bit jobs with the Intel compilers, so I doubt the problem is with Torque. It is probably an issue with library paths.

Torque doesn't automatically forward your environment, nor does it execute your remote .bashrc (or equivalent) when starting your remote process. While ssh also typically doesn't forward the environment (though your sys admin may have set it up to do so), it does execute the remote .bashrc, which could be setting the correct path. I should also note that mpirun will automatically forward LD_LIBRARY_PATH and PATH for you, which is something different from what we do for the other launchers.

If you execute your MPI_Ii_64 program locally on each of your nodes (i.e., both processes run local), does it work? If so, then try adding

-x LD_LIBRARY_PATH

to your mpirun cmd line. This will tell mpirun to pickup your local lib-path and forward it for you regardless of the launch environment.


On Aug 11, 2009, at 10:17 PM, Sims, James S. Dr. wrote:

Back to this problem.

The last suggestion was to upgrade to 1.3.3, which has been done. Still cannot get this code to run in 64 bit mode with torque. What I can do is run the job in l6 bit mode using a hostfile.
Specifically, if I use
qsub -I -l nodes=2:ppn=1 torque allocates two nodes to the job, and since this is an interactive shell, logs me in to the controlling node. In this example process rank 0 is n72 and process rank 1 is n89: [sims@n72 4000]$ mpirun --display-allocation -pernode --display-map hostname

======================   ALLOCATED NODES   ======================

Data for node: Name: n72.clust.nist.gov Num slots: 1 Max slots: 0
Data for node: Name: n89       Num slots: 1    Max slots: 0

=================================================================

========================   JOB MAP   ========================

Data for node: Name: n72.clust.nist.gov        Num procs: 1
       Process OMPI jobid: [47657,1] Process rank: 0

Data for node: Name: n89       Num procs: 1
       Process OMPI jobid: [47657,1] Process rank: 1

=============================================================
n89
n72.clust.nist.gov

My hostfile is
[sims@n72 4000]$ cat hostfile
n72
n89


If, logged in to n72, I use the command
mpirun -np 2 ./MPI_li_64
the job fails with a
mpirun noticed that process rank 1 with PID 10538 on node n89 exited on signal 11 (Segmentation fault).

If I use the command
mpirun -np 2 --hostfile hostfile ./MPI_li_64
the same thing happens.

However, if I ssh to n73, for example, and use the command
mpirun -np 2 --hostfile hostfile ./MPI_li_64
everything works fine. So it appears that the problem is with torque.

Any ideas?
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to