Are there multiple interfaces on your nodes? I'm wondering if we are using a 
different network than the one where you opened these ports.

You'll get quite a bit of output, but you can turn on debug output in the oob 
itself with -mca oob_tcp_verbose xx. The higher the number, the more you get.


On Jul 10, 2010, at 11:14 AM, Robert Walters wrote:

> Hello again,
> 
> I believe my administrator has opened the ports I requested. The problem I am 
> having now is that OpenMPI is not listening to my defined port assignments in 
> openmpi-mca-params.conf (looks like permission 644 on those files should it 
> be 755?)
> 
> When I perform netstat -ltnup I see that orted is listening 14 processes in 
> tcp but scaterred in the 26000ish port range when I specified 60001-60016 in 
> the mca-params file. Is there a parameter I am missing? In any case I am 
> still hanging as mentioned originally even with the port forwarding enabled 
> and specifications in mca-param enabled. 
> 
> Any other ideas on what might be causing the hang? Is there a more verbose 
> mode I can employ to see more deeply into the issue? I have run 
> --debug-daemons and --mca plm_base_verbose 99.
> 
> Thanks!
> --- On Tue, 7/6/10, Robert Walters <raw19...@yahoo.com> wrote:
> 
> From: Robert Walters <raw19...@yahoo.com>
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" <us...@open-mpi.org>
> Date: Tuesday, July 6, 2010, 5:41 PM
> 
> Thanks for your expeditious responses, Ralph.
> 
> Just to confirm with you, I should change openmpi-mca-params.conf to include:
> 
> oob_tcp_port_min_v4 = (My minimum port in the range)
> oob_tcp_port_range_v4 = (My port range)
> btl_tcp_port_min_v4 = (My minimum port in the range)
> btl_tcp_port_range_v4 = (My port range)
> 
> correct?
> 
> Also, for a cluster of around 32-64 processes (8 processors per node), how 
> wide of a range will I require? I've noticed some entries in the mailing list 
> suggesting you need a few to get started and then it opens as necessary. Will 
> I be safe with 20 or should I go for 100? 
> 
> Thanks again for all of your help!
> 
> --- On Tue, 7/6/10, Ralph Castain <r...@open-mpi.org> wrote:
> 
> From: Ralph Castain <r...@open-mpi.org>
> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
> To: "Open MPI Users" <us...@open-mpi.org>
> Date: Tuesday, July 6, 2010, 5:31 PM
> 
> Problem isn't with ssh - the problem is that the daemons need to open a TCP 
> connection back to the machine where mpirun is running. If the firewall 
> blocks that connection, then we can't run.
> 
> If you can get a range of ports opened, then you can specify the ports OMPI 
> should use for this purpose. If the sysadmin won't allow even that, then you 
> are pretty well hosed.
> 
> 
> On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:
> 
>> Yes, there is a system firewall. I don't think the sysadmin will allow it to 
>> go disabled. Each Linux machine has the built-in RHEL firewall. SSH is 
>> enabled through the firewall though.
>> 
>> --- On Tue, 7/6/10, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> From: Ralph Castain <r...@open-mpi.org>
>> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
>> To: "Open MPI Users" <us...@open-mpi.org>
>> Date: Tuesday, July 6, 2010, 4:19 PM
>> 
>> It looks like the remote daemon is starting - is there a firewall in the way?
>> 
>> On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:
>> 
>>> Hello all,
>>> 
>>> I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and 
>>> right now I am just working on getting OpenMPI itself up and running. I 
>>> have a successful configure and make all install. LD_LIBRARY_PATH and PATH 
>>> variables were correctly edited. mpirun -np 8 hello_c successfully works on 
>>> all machines. I have setup my two test machines with DSA key pairs that 
>>> successfully work with each other.
>>> 
>>> The problem comes when I initiate my hostfile to attempt to communicate 
>>> across machines. The hostfile is setup correctly with <host_name> <slots> 
>>> <max-slots>. When running with all verbose options enabled "mpirun --mca 
>>> plm_base_verbose 99 --debug-daemons --mca btl_base_verbose 30 --mca 
>>> oob_base_verbose 99 --mca pml_base_verbose 99 -hostfile hostfile -np 16 
>>> hello_c" I receive the following text output.
>>> 
>>> [machine1:03578] mca: base: components_open: Looking for plm components
>>> [machine1:03578] mca: base: components_open: opening plm components
>>> [machine1:03578] mca: base: components_open: found loaded component rsh
>>> [machine1:03578] mca: base: components_open: component rsh has no register 
>>> function
>>> [machine1:03578] mca: base: components_open: component rsh open function 
>>> successful
>>> [machine1:03578] mca: base: components_open: found loaded component slurm
>>> [machine1:03578] mca: base: components_open: component slurm has no 
>>> register function
>>> [machine1:03578] mca: base: components_open: component slurm open function 
>>> successful
>>> [machine1:03578] mca:base:select: Auto-selecting plm components
>>> [machine1:03578] mca:base:select:(  plm) Querying component [rsh]
>>> [machine1:03578] mca:base:select:(  plm) Query of component [rsh] set 
>>> priority to 10
>>> [machine1:03578] mca:base:select:(  plm) Querying component [slurm]
>>> [machine1:03578] mca:base:select:(  plm) Skipping component [slurm]. Query 
>>> failed to return a module
>>> [machine1:03578] mca:base:select:(  plm) Selected component [rsh]
>>> [machine1:03578] mca: base: close: component slurm closed
>>> [machine1:03578] mca: base: close: unloading component slurm
>>> [machine1:03578] mca: base: components_open: Looking for oob components
>>> [machine1:03578] mca: base: components_open: opening oob components
>>> [machine1:03578] mca: base: components_open: found loaded component tcp
>>> [machine1:03578] mca: base: components_open: component tcp has no register 
>>> function
>>> [machine1:03578] mca: base: components_open: component tcp open function 
>>> successful
>>> Daemon was launched on machine2- beginning to initialize
>>> [machine2:01962] mca: base: components_open: Looking for oob components
>>> [machine2:01962] mca: base: components_open: opening oob components
>>> [machine2:01962] mca: base: components_open: found loaded component tcp
>>> [machine2:01962] mca: base: components_open: component tcp has no register 
>>> function
>>> [machine2:01962] mca: base: components_open: component tcp open function 
>>> successful
>>> Daemon [[1418,0],1] checking in as pid 1962 on host machine2
>>> Daemon [[1418,0],1] not using static ports
>>> 
>>> At this point the system hangs indefinitely. While running top on the 
>>> machine2 terminal, I see several things come up briefly. These items are: 
>>> sshd (root), tcsh (myuser), orted (myuser), and mcstransd (root). I was 
>>> wondering if sshd needs to be initiated by myuser? It is currently turned 
>>> off in sshd_config through UsePAM yes. This was setup by the sysadmin but 
>>> it can be worked around if this is necessary.
>>> 
>>> So in summary, mpirun works on each machine individually, but hangs when 
>>> initiated through a hostfile or with the -host flag. ./configure with 
>>> defaults and --prefix. LD_LIBRARY_PATH and PATH set up correctly. Any help 
>>> is appreciated. Thanks!
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> -----Inline Attachment Follows-----
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -----Inline Attachment Follows-----
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to