Back to this problem.
The last suggestion was to upgrade to 1.3.3, which has been done. Still cannot
get this code to
run in 64 bit mode with torque. What I can do is run the job in l6 bit mode
using a hostfile.
Specifically, if I use
qsub -I -l nodes=2:ppn=1 torque allocates two nodes to the job, and since this
is an interactive
shell, logs me in to the controlling node. In this example process rank 0 is
n72 and process rank 1 is n89:
[sims@n72 4000]$ mpirun --display-allocation -pernode --display-map hostname
====================== ALLOCATED NODES ======================
Data for node: Name: n72.clust.nist.gov Num slots: 1 Max slots: 0
Data for node: Name: n89 Num slots: 1 Max slots: 0
=================================================================
======================== JOB MAP ========================
Data for node: Name: n72.clust.nist.gov Num procs: 1
Process OMPI jobid: [47657,1] Process rank: 0
Data for node: Name: n89 Num procs: 1
Process OMPI jobid: [47657,1] Process rank: 1
=============================================================
n89
n72.clust.nist.gov
My hostfile is
[sims@n72 4000]$ cat hostfile
n72
n89
If, logged in to n72, I use the command
mpirun -np 2 ./MPI_li_64
the job fails with a
mpirun noticed that process rank 1 with PID 10538 on node n89 exited on signal
11 (Segmentation fault).
If I use the command
mpirun -np 2 --hostfile hostfile ./MPI_li_64
the same thing happens.
However, if I ssh to n73, for example, and use the command
mpirun -np 2 --hostfile hostfile ./MPI_li_64
everything works fine. So it appears that the problem is with torque.
Any ideas?