Okay - thanks!

First, be assured we run 64-bit ifort code under Torque at large scale all the time here at LANL, so this is likely to be something trivial in your environment.

A few things to consider/try:

1. most likely culprit is that your LD_LIBRARY_PATH is pointing to the 32-bit libraries on the other nodes. Torque does -not- copy your environment by default, and neither does OMPI. Try adding

-x LD_LIBRARY_PATH

to your cmd line, making sure that the 64-bit libs are before any 32- bit libs in that envar. This tells mpirun to pickup that envar and propagate it for you.

2. check to ensure you are in fact using a 64-bit version of OMPI. Run "ompi_info --config" to see how it was built. Also run "mpif90 -- showme" and see what libs it is linked to. Does your LD_LIBRARY_PATH match the names and ordering?

3. get a multi-node allocation and run "pbsdsh echo $LD_LIBRARY_PATH" and see what libs you are defaulting to on the other nodes.

I realize these are somewhat overlapping, but I think you catch the drift - I suspect you are getting the infamous "library confusion" problem.

HTH
Ralph

On Jul 23, 2009, at 7:49 PM, Sims, James S. Dr. wrote:

[sims@raritan openmpi]$ mpirun -V
mpirun (Open MPI) 1.3.1rc4

________________________________________
From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of Ralph Castain [r...@open-mpi.org]
Sent: Thursday, July 23, 2009 5:44 PM
To: Open MPI Users
Subject: Re: [OMPI users] Open MPI:Problem with 64-bit openMPI and intel compiler

What OMPI version are you using?

On Jul 23, 2009, at 3:00 PM, Sims, James S. Dr. wrote:

I have an OpenMPI  program compiled with a version of OpenMPI built
using the ifort 10.1
compiler. I can compile and run this code with no problem, using the
32 bit
version of ifort. And I can also submit batch jobs using torque with
this 32-bit code.
However, compiling the same code to produce a 64 bit executable
produces a code
that runs correctly only in the simplest cases. It does not run
correctly when run
under the torque batch queuing system, running for awhile and then
giving a
segmentation violation in s section of code that is fine in the 32
bit version.

I have to run the mpi multinode jobs using our torque batch queuing
system,
but we do have the capability of running the jobs in an interactive
batch environment.

If I do a qsub -I -l nodes=1:x4gb
I get an interactive session on the remote node assigned to my job.
I can run the
job using either
./MPI_li_64 or
mpirun -np 1 ./MPI_li_64
and the job runs successfully to completion. I can also
start an interactive shell using
qsub -I -l nodes=1:ppn=2:x4gb
and I will get a single dual processor (or greater node). On this
single node,
mpirun -np 2 ./MPI_li_64 works.
However, if instead I ask for two nodes in my interactive batch node,
qsub -I -l nodes=2:x4gb,
Two nodes will be assigned to me but when I enter
mpirun -np 2 ./MPI_li_64
the job runs awhile, then fails with a
mpirun noticed that process rank 1 with PID 23104 on node n339
exited on signal 11 (Segmentation fault).

I can trace this in the intel debugger and see that the segmentation
fault is occuring in what should
be good code, and in code that executes with no problem when
everything is compiled 32-bit. I am
at a loss for what could be preventing this code to run within the
batch queuing environment in the
64-bit version.

Jim
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to