On Nov 7, 2007, at 4:41 PM, Benjamin, Ted G. wrote:

Please understand that I’m decent at the engineering side of it. As a system administrator, I’m a decent engineer.

On the previous configurations, this program seems to run with any number of processors. I believe these successful users have been using LAM/MPI. While I was waiting for a reply, I installed LAM/ MPI. The results were similar to those from OpenMPI.


This is a good sign; consistent behavior across multiple different MPIs implies a problem at the application or system level (i.e., not the MPI level). Again, I'll not promise that any MPI is bug free, but these signs point to an application/system problem.

While I can choose LAM/MPI, I’d prefer to port it to OpenMPI since that is where all the development and most of the support are.


Good.

I cannot choose the Portland compiler. I must use either GNU or Intel compilers on the Itanium2.


Ok.

> Have you tried running your code through a memory-checking debugger,

> and/or examining any corefiles that were generated to see if there is

> a problem in your code?

> I will certainly not guarantee that Open MPI is bug free, but problems

> like this are *usually* application-level issues. One place I always

> start is running the application in a debugger to see if you can catch

> exactly where the Badness happens.  This can be most helpful.

I have tried to run a debugger, but I am not an expert at it. I could not get Intel’s idb debugger to give me a prompt, but I could get a prompt from gdb. I’ve looked over the manual, but I’m not sure how to put in the breakpoints et. al. that you geniuses use to evaluate a program at critical junctures. I actually used an “mpirun –np 2 dbg” command to run it on 2 CPUs. I attached the file at the prompt. When I did a run, it ran fine with no optimization and one processor. With 2 processors, it didn’t seem to do anything. All I will say here is that I have a lot to learn. I’m calling on my friends for help on this.


For such small rung, I typically do the lazy thing:

- mpirun -np 2 ... as normal
- login to the node(s) where the jobs were launched
- use "gdb --pid <pid>" to attach to each of the jobs
- when gdb attaches, use the "continue" command to let the jobs keep running
- eventually, the problem will occur and the process will die
- in several kinds of scenarios, gdb will show you right where it died

Consult the gdb documentation and/or any local resources you have for more details.

> That's fun. Can you tell if it runs the app at all, or if it dies before

> main() starts?  This is probably more of an issue for your

> intel support guy than us...

It’s a Fortran program. It starts in the main program. I inserted some PRINT*, statements of the “PRINT*,’Read the input at line 213’ ” variety into the main program to see what would print. It printed the first four statements, but it didn’t reach the last three. The calls that were reached were in the set-up section of the program. The section that wasn’t reached had a lot of matrix-setting and solving subroutine calls.


That's also a good sign; it started to execute and then died later. So it's not a system-level issue that prevents the app from starting; that eliminates one whole line of troubleshooting.

     mpirun –np 2 mpi_hello

and

     mpirun –np 2 non_mpi_hello

print two “Hello, world”s).


So just to be absolutely clear: this is expected behavior. Open MPI's mpirun can launch non-MPI applications.

This is my mistake. I attached an old version of ompi_info.txt. I am now attaching the correct version. I already have 1.2.4 installed.



Gotcha. I would proceed with seeing what the debugger will tell you, or, failing that, putting more and more printf's in to narrow down exactly where things fail. I'm an advocate of using tools, though -- so I tend to prefer using debuggers. But sometimes a small number of printf's are ok. :-)

Good luck.

--
Jeff Squyres
Cisco Systems


Reply via email to