On Nov 7, 2007, at 4:41 PM, Benjamin, Ted G. wrote:
Please understand that I’m decent at the engineering side of it. As
a system administrator, I’m a decent engineer.
On the previous configurations, this program seems to run with any
number of processors. I believe these successful users have been
using LAM/MPI. While I was waiting for a reply, I installed LAM/
MPI. The results were similar to those from OpenMPI.
This is a good sign; consistent behavior across multiple different
MPIs implies a problem at the application or system level (i.e., not
the MPI level). Again, I'll not promise that any MPI is bug free, but
these signs point to an application/system problem.
While I can choose LAM/MPI, I’d prefer to port it to OpenMPI since
that is where all the development and most of the support are.
Good.
I cannot choose the Portland compiler. I must use either GNU or
Intel compilers on the Itanium2.
Ok.
> Have you tried running your code through a memory-checking debugger,
> and/or examining any corefiles that were generated to see if there
is
> a problem in your code?
> I will certainly not guarantee that Open MPI is bug free, but
problems
> like this are *usually* application-level issues. One place I
always
> start is running the application in a debugger to see if you can
catch
> exactly where the Badness happens. This can be most helpful.
I have tried to run a debugger, but I am not an expert at it. I
could not get Intel’s idb debugger to give me a prompt, but I could
get a prompt from gdb. I’ve looked over the manual, but I’m not
sure how to put in the breakpoints et. al. that you geniuses use to
evaluate a program at critical junctures. I actually used an
“mpirun –np 2 dbg” command to run it on 2 CPUs. I attached the file
at the prompt. When I did a run, it ran fine with no optimization
and one processor. With 2 processors, it didn’t seem to do
anything. All I will say here is that I have a lot to learn. I’m
calling on my friends for help on this.
For such small rung, I typically do the lazy thing:
- mpirun -np 2 ... as normal
- login to the node(s) where the jobs were launched
- use "gdb --pid <pid>" to attach to each of the jobs
- when gdb attaches, use the "continue" command to let the jobs keep
running
- eventually, the problem will occur and the process will die
- in several kinds of scenarios, gdb will show you right where it died
Consult the gdb documentation and/or any local resources you have for
more details.
> That's fun. Can you tell if it runs the app at all, or if it dies
before
> main() starts? This is probably more of an issue for your
> intel support guy than us...
It’s a Fortran program. It starts in the main program. I inserted
some PRINT*, statements of the “PRINT*,’Read the input at line 213’
” variety into the main program to see what would print. It printed
the first four statements, but it didn’t reach the last three. The
calls that were reached were in the set-up section of the program.
The section that wasn’t reached had a lot of matrix-setting and
solving subroutine calls.
That's also a good sign; it started to execute and then died later.
So it's not a system-level issue that prevents the app from starting;
that eliminates one whole line of troubleshooting.
mpirun –np 2 mpi_hello
and
mpirun –np 2 non_mpi_hello
print two “Hello, world”s).
So just to be absolutely clear: this is expected behavior. Open MPI's
mpirun can launch non-MPI applications.
This is my mistake. I attached an old version of ompi_info.txt. I
am now attaching the correct version. I already have 1.2.4 installed.
Gotcha. I would proceed with seeing what the debugger will tell you,
or, failing that, putting more and more printf's in to narrow down
exactly where things fail. I'm an advocate of using tools, though --
so I tend to prefer using debuggers. But sometimes a small number of
printf's are ok. :-)
Good luck.
--
Jeff Squyres
Cisco Systems