Hi Jeff, thanks for the response.
As soon as I can afford to reboot my workstation,
like tomorrow, I will test as you suggest whether the computer
actually hangs or just slows down. For exhaustive kernel logging,
I replaced the following line
kern.*               -/var/log/kern.log
with
kern.*                /var/log/kern.log
in my /etc/rsyslog.d/50-default.conf file, does that look about right?

Regards,

Olivier Marsden

Jeff Squyres wrote:
On Jul 7, 2010, at 10:20 AM, Olivier Marsden wrote:

The (7 process) code runs correctly on my workstation using mpich2 (latest
stable version) & ifort 11.1, using intel-mpi & ifort 11.1, but randomly hangs the
computer (vanilla ubuntu 9.10 kernel v. 2.6.31 ) to the point where only
a magic
sysrq combination can "save" me (i.e. reboot), when using
- openmpi 1.4.2 compiled from source with gcc, ifort for mpif90
- clustertools v. 8.2.1c distribution from sun/oracle, also based on
openmpi 1.4.2, using sun f90
  for mpif90

Yowza.  Open MPI is user space code, so it should never be able to hang the 
entire computer.  Open MPI and MPICH2 do implement things in very different 
ways, so it's quite possible that we trip entirely different code paths in the 
same linux kernel.

Never say "never" -- it could well be an Open MPI bug.  But it smells like a 
kernel bug...

I am prepared to do some testing if that can help, but don't know the
best way to identify what's going on.
I have found no useful information in the syslog files.

Is the machine totally hung?  Or is it just running really, really slowly?  Try leaving 
some kind of slowly-monitoring process running in the background and see if it keeps 
running (perhaps even more slowly than before) when the machine hangs.  E.g., something 
like a shell script that loops over sleeping for a second and then appending the output 
of "date" to a file.  Or something like that.

My point: see if Open MPI went into some hyper-aggressive mode where it's (literally) 
stealing every available cycle and making the machine look hung.  You might even want to 
try running the OMPI procs at a low priority to see if it can help alleviate the 
"steal all cycles" mode (if that is, indeed, what is happening).

If the machine is truly hung, then something else might be going on.  Do any 
kernel logs report anything?  Can you crank up your syslog to report *all* 
events, for example?


Reply via email to