Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Ashley Pittman
On Tue, 2009-05-19 at 14:01 -0400, Noam Bernstein wrote: I'm glad you got to the bottom of it. > With one of them, apparently, CP2K will silently go on if > the > file is missing, but then lock up in an MPI call (maybe it leaves > some > variables uninitialized, and then uses them in the call

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Noam Bernstein
On May 19, 2009, at 12:13 PM, Ashley Pittman wrote: That is indeed odd but it shouldn't be too hard to track down, how often does the failure occur? Presumably when you say you have three invocations of the program they communicate via files, is the location of these files changing? Yeay.

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Noam Bernstein
On May 19, 2009, at 12:13 PM, Ashley Pittman wrote: On Tue, 2009-05-19 at 11:01 -0400, Noam Bernstein wrote: I'd suspect the filesystem too, except that it's hung up in an MPI call. As I said before, the whole thing is bizarre. It doesn't matter where the executable is, just what CWD is (i.

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Jeff Squyres
On May 19, 2009, at 12:13 PM, Ashley Pittman wrote: Finally if you could run it with "--mca btl ^ofed" to rule out the ofed stack causing the problem that would be useful. You'd need to check the syntax here. --mca btl ^openib We're stuck with that old name for now -- see http://www.open

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Ashley Pittman
On Tue, 2009-05-19 at 11:01 -0400, Noam Bernstein wrote: > I'd suspect the filesystem too, except that it's hung up in an MPI > call. As I said > before, the whole thing is bizarre. It doesn't matter where the > executable is, > just what CWD is (i.e. I can do mpirun /scratch/exec or mpirun

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Noam Bernstein
On May 19, 2009, at 9:32 AM, Ashley Pittman wrote: Can you confirm that *all* processes are in PMPI_Allreduce at some point, the collectives commonly get blamed for a lot of hangs and it's not always the correct place to look. For the openmpi run, every single process showed one of those two

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Jeff Squyres
On May 19, 2009, at 9:13 AM, Noam Bernstein wrote: The MPI code isn't calling fork or system. The serial code is calling system("mpirun cp2k.popt"). That runs to completion, processes the output files, and calls system("mpirun cp2k.popt") again, and so on. Is that in fact likely to be a problem

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Ashley Pittman
On Mon, 2009-05-18 at 17:05 -0400, Noam Bernstein wrote: > The code is complicated, the input files are big and lead to long > computation > times, so I don't think I'll be able to make a simple test case. > Instead > I attached to the hanging processes (all 8 of them) with gdb > during the h

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Noam Bernstein
On May 19, 2009, at 8:29 AM, Jeff Squyres wrote: fork() support in OpenFabrics has always been dicey -- it can lead to random behavior like this. Supposedly it works in a specific set of circumstances, but I don't have a recent enough kernel on my machines to test. It's best not to use

Re: [OMPI users] CP2K mpi hang

2009-05-19 Thread Jeff Squyres
fork() support in OpenFabrics has always been dicey -- it can lead to random behavior like this. Supposedly it works in a specific set of circumstances, but I don't have a recent enough kernel on my machines to test. It's best not to use calls to system() if they can be avoided. Indeed,

[OMPI users] CP2K mpi hang

2009-05-18 Thread Noam Bernstein
Hi all - I have a bizarre OpenMPI hanging problem. I'm running an MPI code called CP2K (related to, but not the same as cpmd). The complications of the software aside, here are the observations: At the base is a serial code that uses system() calls to repeatedly invoke mpirun cp2k.pop