Re: [OMPI users] mpirun hanging when processes started on head node

Ralph H Castain Mon, 11 Jun 2007 15:49:46 -0400

Hi Sean

Could you please clarify something? I¹m a little confused by your comments
about where things are running. I¹m assuming that you mean everything works
fine if you type the mpirun command on the head node and just let it launch
on your compute nodes  that the problems only occur when you specifically
tell mpirun you want processes on the head node as well (or exclusively). Is
that correct?

There are several possible sources of trouble, if I have understood your
situation correctly. Our bproc support is somewhat limited at the moment 
you may be encountering one of those limits. We currently have bproc support
focused on the configuration here at Los Alamos National Lab as (a) that is
where the bproc-related developers are working, and (b) it is the only
regular test environment we have to work with for bproc. We don¹t normally
use bproc in combination with hostfiles, so I¹m not sure if there is a
problem in that combination. I can investigate that a little later this
week.

Similarly, we require that all the nodes being used must be accessible via
the same launch environment. It sounds like we may be able to launch
processes on your head node via rsh, but not necessarily bproc. You might
check to ensure that the head node will allow bproc-based process launch (I
know ours don¹t  all jobs are run solely on the compute nodes. I believe
that is generally the case). We don¹t currently support mixed environments,
and I honestly don¹t expect that to change anytime soon.

Hope that helps at least a little.
Ralph

On 6/11/07 1:04 PM, "Kelley, Sean" <sean.kel...@solers.com> wrote:

> I forgot to add that we are using 'bproc'. Launching processes on the compute
> nodes using bproc works well, I'm not sure if bproc is involved when processes
> are launched on the local node.
>  
> Sean
> 
> 
> From: users-boun...@open-mpi.org on behalf of Kelley, Sean
> Sent: Mon 6/11/2007 2:07 PM
> To: us...@open-mpi.org
> Subject: [OMPI users] mpirun hanging when processes started on head node
> 
> Hi,
>       We are running the OFED 1.2rc4 distribution containing openmpi-1.2.2 on
> a RedhatEL4U4 system with Scyld Clusterware 4.1. The hardware configuration
> consists of a DELL 2950 as the headnode and 3 DELL 1950 blades as compute
> nodes using Cisco TopSpin Infiband HCAs and switches for the interconnect.
>  
>        When we use 'mpirun' from the OFED/Open MPI distribution to start
> processes on the compute nodes, everything works correctly. However, when we
> try to start processes on the head node, the processes appear to run correctly
> but 'mpirun' hangs and does not terminate until killed. The attached
> 'run1.tgz' file contains detailed information from running the following
> command:
>  
>       mpirun --hostfile hostfile1 --np 1 --byslot --debug-daemons -d hostname
>  
> where 'hostfile1' contains the following:
>  
> -1 slots=2 max_slots=2
>  
> The 'run.log' is the output of the above line. The 'strace.out.0' is the
> result of 'strace -f' on the mpirun process (and the 'hostname' child process
> since mpirun simply forks the local processes). The child process (pid 23415
> in this case) runs to completion and exits successfully. The parent process
> (mpirun) doesn't appear to recognize that the child has completed and hangs
> until killed (with a ^c).
>  
> Additionally, when we run a set of processes which span the headnode and the
> compute nodes, the processes on the head node complete successfully, but the
> processes on the compute nodes do not appear to start. mpirun again appears to
> hang.
>  
> Do I have a configuration error or is there a problem that I have encountered?
> Thank you in advance for your assistance or suggestions
>  
> Sean
>  
> ------
> Sean M. Kelley
> sean.kel...@solers.com
>  
>  
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mpirun hanging when processes started on head node

Reply via email to