On Nov 5, 2007, at 4:12 PM, Benjamin, Ted G. wrote:
I have a code that runs with both Portland and Intel compilers
on X86, AMD64 and Intel EM64T running various flavors of Linux on
clusters. I am trying to port it to a 2-CPU Itanium2 (ia64) running
Red Hat Enterprise Linux 4.0; it has gcc 3.4.6-8 and the Intel
Fortran compiler 10.0.026 installed. I have built Open MPI 1.2.4
using these compilers.
When I built the Open MPI, I didn’t do anything special. I
enabled debug, but that was really all. Of course, you can see that
in the config file that is attached.
This system is not part of a cluster. The two onboard CPUs (an
HP zx6000) are the only processors on which the job runs. The code
must run on MPI because the source calls it. I compiled the target
software using the Fortran90 compiler (mpif90).
I’ve been running the code in the foreground so that I could
keep an eye on its behavior.
When I try to run the compiled and linked code [mpirun –np #
{executable file}], it performs as shown below:
(1) With the source compiled at optimization –O0 and –np 1, the job
runs very slowly (6 days on the wall clock) to the correct answer on
the benchmark;
(2) With the source compiled at optimization –O0 and –np 2, the
benchmark job fails with a segmentation violation;
Have you tried running your code through a memory-checking debugger,
and/or examining any corefiles that were generated to see if there is
a problem in your code?
I will certainly not guarantee that Open MPI is bug free, but problems
like this are *usually* application-level issues. One place I always
start is running the application in a debugger to see if you can catch
exactly where the Badness happens. This can be most helpful.
(3) With the source compiled at all other optimization (-O1, -O2, -
O3) and processor combinations (-np1 and -np 2), it fails in what I
would call a “quiescent” manner. What I mean by this is that it
does not produce any error messages. When I submit the job, it
produces a little standard output and it quits after 2-3 seconds.
That's fun. Can you tell if it runs the app at all, or if it dies
before main() starts? This is probably more of an issue for your
intel support guy than us...
In an attempt to find the problem, the technical support agent
at Intel has had me run some simple “Hello” problems.
The first one is an MPI hello code that is the attached
hello_mpi.f. This ran as expected, and it echoed one “Hello” for
each of the two processors.
The second one is a non-MPI hello that is the attached
hello.f90. Since it is a non-MPI source, I was told that running it
on a workstation with a properly configured MPI should only echo one
“Hello”; the Intel agent told me that two such echoes indicate a
problem with Open MPI. It echoed twice, so now I have come to you
for help.
I'm not sure what you mean by that. If you:
mpirun -np 4 hostname
where "hostname" is non-MPI program (e.g., /bin/hostname), you'll
still see the output 4 times because you told MPI to run 4 copies of
"hostname". In this way, Open MPI is just being used as a job launcher.
So if I'm understanding you right,
mpirun -np 2 my_non_mpi_f90_hello_app
should still print 2 copies of "hello". If it does, then Open MPI is
doing exactly what it should do.
Specifically: Open MPI's mpirun can be used to launch non-MPI
applications (the same is not necessarily true for other MPI
implementations).
The other three attached files are the output requested on the
“Getting Help” page – (1) the output of /sbin/ifconfig, (2) the
output of ompt_info –all and (3) the config.log file.
The installation of the Open MPI itself was as easy as could
be. I am really ignorant of how it works beyond what I’ve read from
the FAQs and learned in a little digging, so I hope it’s a simple
solution.
FWIW, I see that you're using Open MPI v1.2. Our latest version is
v1.2.4; if possible, you might want to try and upgrade (e.g., delete
your prior installation, recompile/reinstall Open MPI, and then
recompile/relink your application against the new Open MPI
installation); it has all of our latest bug fixes, etc.
--
Jeff Squyres
Cisco Systems