Ralph,
     Thanks for the quick response, clarifications below.
      Sean

________________________________

From: users-boun...@open-mpi.org on behalf of Ralph H Castain
Sent: Mon 6/11/2007 3:49 PM
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] mpirun hanging when processes started on head node


Hi Sean

Could you please clarify something? I'm a little confused by your comments 
about where things are running. I'm assuming that you mean everything works 
fine if you type the mpirun command on the head node and just let it launch on 
your compute nodes - that the problems only occur when you specifically tell 
mpirun you want processes on the head node as well (or exclusively). Is that 
correct?

[Sean] This is correct.


There are several possible sources of trouble, if I have understood your 
situation correctly. Our bproc support is somewhat limited at the moment - you 
may be encountering one of those limits. We currently have bproc support 
focused on the configuration here at Los Alamos National Lab as (a) that is 
where the bproc-related developers are working, and (b) it is the only regular 
test environment we have to work with for bproc. We don't normally use bproc in 
combination with hostfiles, so I'm not sure if there is a problem in that 
combination. I can investigate that a little later this week.

[Sean] If it is helpful, running 'export NODES=-1; mpirun -np 1 hostname' 
exibits identical behaviour.

Similarly, we require that all the nodes being used must be accessible via the 
same launch environment. It sounds like we may be able to launch processes on 
your head node via rsh, but not necessarily bproc. You might check to ensure 
that the head node will allow bproc-based process launch (I know ours don't - 
all jobs are run solely on the compute nodes. I believe that is generally the 
case). We don't currently support mixed environments, and I honestly don't 
expect that to change anytime soon.


[Sean] I'm working through the strace output to follow the progression on the 
head node. It looks like mpirun consults '/bpfs/self' and determines that the 
request is to be run on the local machine so it fork/execs 'orted' which then 
runs 'hostname'. 'mpirun' didn't consult '/bpfs' or utilize 'rsh' after the 
determination to run on the local machine was made. When the 'hostname' command 
completes, 'orted' receives the SIGCHLD signal, performs some work and then 
both 'mpirun' and 'orted' go into what appears to be a poll() waiting for 
events.


Hope that helps at least a little.

[Sean] I appreciate the help. We are running processes on the head node because 
the head node is the only node which can access external resources (storage 
devices). 


Ralph





On 6/11/07 1:04 PM, "Kelley, Sean" <sean.kel...@solers.com> wrote:



        I forgot to add that we are using 'bproc'. Launching processes on the 
compute nodes using bproc works well, I'm not sure if bproc is involved when 
processes are launched on the local node.
        
        Sean
        
        
________________________________

        From: users-boun...@open-mpi.org on behalf of Kelley, Sean
        Sent: Mon 6/11/2007 2:07 PM
        To: us...@open-mpi.org
        Subject: [OMPI users] mpirun hanging when processes started on head node
        
        Hi,
              We are running the OFED 1.2rc4 distribution containing 
openmpi-1.2.2 on a RedhatEL4U4 system with Scyld Clusterware 4.1. The hardware 
configuration consists of a DELL 2950 as the headnode and 3 DELL 1950 blades as 
compute nodes using Cisco TopSpin Infiband HCAs and switches for the 
interconnect.
        
              When we use 'mpirun' from the OFED/Open MPI distribution to start 
processes on the compute nodes, everything works correctly. However, when we 
try to start processes on the head node, the processes appear to run correctly 
but 'mpirun' hangs and does not terminate until killed. The attached 'run1.tgz' 
file contains detailed information from running the following command:
        
             mpirun --hostfile hostfile1 --np 1 --byslot --debug-daemons -d 
hostname
        
        where 'hostfile1' contains the following:
        
        -1 slots=2 max_slots=2
        
        The 'run.log' is the output of the above line. The 'strace.out.0' is 
the result of 'strace -f' on the mpirun process (and the 'hostname' child 
process since mpirun simply forks the local processes). The child process (pid 
23415 in this case) runs to completion and exits successfully. The parent 
process (mpirun) doesn't appear to recognize that the child has completed and 
hangs until killed (with a ^c). 
        
        Additionally, when we run a set of processes which span the headnode 
and the compute nodes, the processes on the head node complete successfully, 
but the processes on the compute nodes do not appear to start. mpirun again 
appears to hang.
        
        Do I have a configuration error or is there a problem that I have 
encountered? Thank you in advance for your assistance or suggestions
        
        Sean
        
        ------
        Sean M. Kelley
        sean.kel...@solers.com
        
         
        
        
________________________________

        _______________________________________________
        users mailing list
        us...@open-mpi.org
        http://www.open-mpi.org/mailman/listinfo.cgi/users
        


Reply via email to