Problem isn't with ssh - the problem is that the daemons need to open a TCP 
connection back to the machine where mpirun is running. If the firewall blocks 
that connection, then we can't run.

If you can get a range of ports opened, then you can specify the ports OMPI 
should use for this purpose. If the sysadmin won't allow even that, then you 
are pretty well hosed.


On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:

> Yes, there is a system firewall. I don't think the sysadmin will allow it to 
> go disabled. Each Linux machine has the built-in RHEL firewall. SSH is 
> enabled through the firewall though.
> 
> --- On Tue, 7/6/10, Ralph Castain <r...@open-mpi.org> wrote:
> 
> From: Ralph Castain <r...@open-mpi.org>
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" <us...@open-mpi.org>
> Date: Tuesday, July 6, 2010, 4:19 PM
> 
> It looks like the remote daemon is starting - is there a firewall in the way?
> 
> On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:
> 
>> Hello all,
>> 
>> I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and 
>> right now I am just working on getting OpenMPI itself up and running. I have 
>> a successful configure and make all install. LD_LIBRARY_PATH and PATH 
>> variables were correctly edited. mpirun -np 8 hello_c successfully works on 
>> all machines. I have setup my two test machines with DSA key pairs that 
>> successfully work with each other.
>> 
>> The problem comes when I initiate my hostfile to attempt to communicate 
>> across machines. The hostfile is setup correctly with <host_name> <slots> 
>> <max-slots>. When running with all verbose options enabled "mpirun --mca 
>> plm_base_verbose 99 --debug-daemons --mca btl_base_verbose 30 --mca 
>> oob_base_verbose 99 --mca pml_base_verbose 99 -hostfile hostfile -np 16 
>> hello_c" I receive the following text output.
>> 
>> [machine1:03578] mca: base: components_open: Looking for plm components
>> [machine1:03578] mca: base: components_open: opening plm components
>> [machine1:03578] mca: base: components_open: found loaded component rsh
>> [machine1:03578] mca: base: components_open: component rsh has no register 
>> function
>> [machine1:03578] mca: base: components_open: component rsh open function 
>> successful
>> [machine1:03578] mca: base: components_open: found loaded component slurm
>> [machine1:03578] mca: base: components_open: component slurm has no register 
>> function
>> [machine1:03578] mca: base: components_open: component slurm open function 
>> successful
>> [machine1:03578] mca:base:select: Auto-selecting plm components
>> [machine1:03578] mca:base:select:(  plm) Querying component [rsh]
>> [machine1:03578] mca:base:select:(  plm) Query of component [rsh] set 
>> priority to 10
>> [machine1:03578] mca:base:select:(  plm) Querying component [slurm]
>> [machine1:03578] mca:base:select:(  plm) Skipping component [slurm]. Query 
>> failed to return a module
>> [machine1:03578] mca:base:select:(  plm) Selected component [rsh]
>> [machine1:03578] mca: base: close: component slurm closed
>> [machine1:03578] mca: base: close: unloading component slurm
>> [machine1:03578] mca: base: components_open: Looking for oob components
>> [machine1:03578] mca: base: components_open: opening oob components
>> [machine1:03578] mca: base: components_open: found loaded component tcp
>> [machine1:03578] mca: base: components_open: component tcp has no register 
>> function
>> [machine1:03578] mca: base: components_open: component tcp open function 
>> successful
>> Daemon was launched on machine2- beginning to initialize
>> [machine2:01962] mca: base: components_open: Looking for oob components
>> [machine2:01962] mca: base: components_open: opening oob components
>> [machine2:01962] mca: base: components_open: found loaded component tcp
>> [machine2:01962] mca: base: components_open: component tcp has no register 
>> function
>> [machine2:01962] mca: base: components_open: component tcp open function 
>> successful
>> Daemon [[1418,0],1] checking in as pid 1962 on host machine2
>> Daemon [[1418,0],1] not using static ports
>> 
>> At this point the system hangs indefinitely. While running top on the 
>> machine2 terminal, I see several things come up briefly. These items are: 
>> sshd (root), tcsh (myuser), orted (myuser), and mcstransd (root). I was 
>> wondering if sshd needs to be initiated by myuser? It is currently turned 
>> off in sshd_config through UsePAM yes. This was setup by the sysadmin but it 
>> can be worked around if this is necessary.
>> 
>> So in summary, mpirun works on each machine individually, but hangs when 
>> initiated through a hostfile or with the -host flag. ./configure with 
>> defaults and --prefix. LD_LIBRARY_PATH and PATH set up correctly. Any help 
>> is appreciated. Thanks!
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -----Inline Attachment Follows-----
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to