Are there multiple interfaces on your nodes? I'm wondering if we are using a different network than the one where you opened these ports.
You'll get quite a bit of output, but you can turn on debug output in the oob itself with -mca oob_tcp_verbose xx. The higher the number, the more you get. On Jul 10, 2010, at 11:14 AM, Robert Walters wrote: > Hello again, > > I believe my administrator has opened the ports I requested. The problem I am > having now is that OpenMPI is not listening to my defined port assignments in > openmpi-mca-params.conf (looks like permission 644 on those files should it > be 755?) > > When I perform netstat -ltnup I see that orted is listening 14 processes in > tcp but scaterred in the 26000ish port range when I specified 60001-60016 in > the mca-params file. Is there a parameter I am missing? In any case I am > still hanging as mentioned originally even with the port forwarding enabled > and specifications in mca-param enabled. > > Any other ideas on what might be causing the hang? Is there a more verbose > mode I can employ to see more deeply into the issue? I have run > --debug-daemons and --mca plm_base_verbose 99. > > Thanks! > --- On Tue, 7/6/10, Robert Walters <raw19...@yahoo.com> wrote: > > From: Robert Walters <raw19...@yahoo.com> > Subject: Re: [OMPI users] OpenMPI Hangs, No Error > To: "Open MPI Users" <us...@open-mpi.org> > Date: Tuesday, July 6, 2010, 5:41 PM > > Thanks for your expeditious responses, Ralph. > > Just to confirm with you, I should change openmpi-mca-params.conf to include: > > oob_tcp_port_min_v4 = (My minimum port in the range) > oob_tcp_port_range_v4 = (My port range) > btl_tcp_port_min_v4 = (My minimum port in the range) > btl_tcp_port_range_v4 = (My port range) > > correct? > > Also, for a cluster of around 32-64 processes (8 processors per node), how > wide of a range will I require? I've noticed some entries in the mailing list > suggesting you need a few to get started and then it opens as necessary. Will > I be safe with 20 or should I go for 100? > > Thanks again for all of your help! > > --- On Tue, 7/6/10, Ralph Castain <r...@open-mpi.org> wrote: > > From: Ralph Castain <r...@open-mpi.org> > Subject: Re: [OMPI users] OpenMPI Hangs, No Error > To: "Open MPI Users" <us...@open-mpi.org> > Date: Tuesday, July 6, 2010, 5:31 PM > > Problem isn't with ssh - the problem is that the daemons need to open a TCP > connection back to the machine where mpirun is running. If the firewall > blocks that connection, then we can't run. > > If you can get a range of ports opened, then you can specify the ports OMPI > should use for this purpose. If the sysadmin won't allow even that, then you > are pretty well hosed. > > > On Jul 6, 2010, at 2:23 PM, Robert Walters wrote: > >> Yes, there is a system firewall. I don't think the sysadmin will allow it to >> go disabled. Each Linux machine has the built-in RHEL firewall. SSH is >> enabled through the firewall though. >> >> --- On Tue, 7/6/10, Ralph Castain <r...@open-mpi.org> wrote: >> >> From: Ralph Castain <r...@open-mpi.org> >> Subject: Re: [OMPI users] OpenMPI Hangs, No Error >> To: "Open MPI Users" <us...@open-mpi.org> >> Date: Tuesday, July 6, 2010, 4:19 PM >> >> It looks like the remote daemon is starting - is there a firewall in the way? >> >> On Jul 6, 2010, at 2:04 PM, Robert Walters wrote: >> >>> Hello all, >>> >>> I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and >>> right now I am just working on getting OpenMPI itself up and running. I >>> have a successful configure and make all install. LD_LIBRARY_PATH and PATH >>> variables were correctly edited. mpirun -np 8 hello_c successfully works on >>> all machines. I have setup my two test machines with DSA key pairs that >>> successfully work with each other. >>> >>> The problem comes when I initiate my hostfile to attempt to communicate >>> across machines. The hostfile is setup correctly with <host_name> <slots> >>> <max-slots>. When running with all verbose options enabled "mpirun --mca >>> plm_base_verbose 99 --debug-daemons --mca btl_base_verbose 30 --mca >>> oob_base_verbose 99 --mca pml_base_verbose 99 -hostfile hostfile -np 16 >>> hello_c" I receive the following text output. >>> >>> [machine1:03578] mca: base: components_open: Looking for plm components >>> [machine1:03578] mca: base: components_open: opening plm components >>> [machine1:03578] mca: base: components_open: found loaded component rsh >>> [machine1:03578] mca: base: components_open: component rsh has no register >>> function >>> [machine1:03578] mca: base: components_open: component rsh open function >>> successful >>> [machine1:03578] mca: base: components_open: found loaded component slurm >>> [machine1:03578] mca: base: components_open: component slurm has no >>> register function >>> [machine1:03578] mca: base: components_open: component slurm open function >>> successful >>> [machine1:03578] mca:base:select: Auto-selecting plm components >>> [machine1:03578] mca:base:select:( plm) Querying component [rsh] >>> [machine1:03578] mca:base:select:( plm) Query of component [rsh] set >>> priority to 10 >>> [machine1:03578] mca:base:select:( plm) Querying component [slurm] >>> [machine1:03578] mca:base:select:( plm) Skipping component [slurm]. Query >>> failed to return a module >>> [machine1:03578] mca:base:select:( plm) Selected component [rsh] >>> [machine1:03578] mca: base: close: component slurm closed >>> [machine1:03578] mca: base: close: unloading component slurm >>> [machine1:03578] mca: base: components_open: Looking for oob components >>> [machine1:03578] mca: base: components_open: opening oob components >>> [machine1:03578] mca: base: components_open: found loaded component tcp >>> [machine1:03578] mca: base: components_open: component tcp has no register >>> function >>> [machine1:03578] mca: base: components_open: component tcp open function >>> successful >>> Daemon was launched on machine2- beginning to initialize >>> [machine2:01962] mca: base: components_open: Looking for oob components >>> [machine2:01962] mca: base: components_open: opening oob components >>> [machine2:01962] mca: base: components_open: found loaded component tcp >>> [machine2:01962] mca: base: components_open: component tcp has no register >>> function >>> [machine2:01962] mca: base: components_open: component tcp open function >>> successful >>> Daemon [[1418,0],1] checking in as pid 1962 on host machine2 >>> Daemon [[1418,0],1] not using static ports >>> >>> At this point the system hangs indefinitely. While running top on the >>> machine2 terminal, I see several things come up briefly. These items are: >>> sshd (root), tcsh (myuser), orted (myuser), and mcstransd (root). I was >>> wondering if sshd needs to be initiated by myuser? It is currently turned >>> off in sshd_config through UsePAM yes. This was setup by the sysadmin but >>> it can be worked around if this is necessary. >>> >>> So in summary, mpirun works on each machine individually, but hangs when >>> initiated through a hostfile or with the -host flag. ./configure with >>> defaults and --prefix. LD_LIBRARY_PATH and PATH set up correctly. Any help >>> is appreciated. Thanks! >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -----Inline Attachment Follows----- >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -----Inline Attachment Follows----- > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users