It looks like the remote daemon is starting - is there a firewall in the way?

On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:

> Hello all,
> 
> I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD Opteron's and right 
> now I am just working on getting OpenMPI itself up and running. I have a 
> successful configure and make all install. LD_LIBRARY_PATH and PATH variables 
> were correctly edited. mpirun -np 8 hello_c successfully works on all 
> machines. I have setup my two test machines with DSA key pairs that 
> successfully work with each other.
> 
> The problem comes when I initiate my hostfile to attempt to communicate 
> across machines. The hostfile is setup correctly with <host_name> <slots> 
> <max-slots>. When running with all verbose options enabled "mpirun --mca 
> plm_base_verbose 99 --debug-daemons --mca btl_base_verbose 30 --mca 
> oob_base_verbose 99 --mca pml_base_verbose 99 -hostfile hostfile -np 16 
> hello_c" I receive the following text output.
> 
> [machine1:03578] mca: base: components_open: Looking for plm components
> [machine1:03578] mca: base: components_open: opening plm components
> [machine1:03578] mca: base: components_open: found loaded component rsh
> [machine1:03578] mca: base: components_open: component rsh has no register 
> function
> [machine1:03578] mca: base: components_open: component rsh open function 
> successful
> [machine1:03578] mca: base: components_open: found loaded component slurm
> [machine1:03578] mca: base: components_open: component slurm has no register 
> function
> [machine1:03578] mca: base: components_open: component slurm open function 
> successful
> [machine1:03578] mca:base:select: Auto-selecting plm components
> [machine1:03578] mca:base:select:(  plm) Querying component [rsh]
> [machine1:03578] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [machine1:03578] mca:base:select:(  plm) Querying component [slurm]
> [machine1:03578] mca:base:select:(  plm) Skipping component [slurm]. Query 
> failed to return a module
> [machine1:03578] mca:base:select:(  plm) Selected component [rsh]
> [machine1:03578] mca: base: close: component slurm closed
> [machine1:03578] mca: base: close: unloading component slurm
> [machine1:03578] mca: base: components_open: Looking for oob components
> [machine1:03578] mca: base: components_open: opening oob components
> [machine1:03578] mca: base: components_open: found loaded component tcp
> [machine1:03578] mca: base: components_open: component tcp has no register 
> function
> [machine1:03578] mca: base: components_open: component tcp open function 
> successful
> Daemon was launched on machine2- beginning to initialize
> [machine2:01962] mca: base: components_open: Looking for oob components
> [machine2:01962] mca: base: components_open: opening oob components
> [machine2:01962] mca: base: components_open: found loaded component tcp
> [machine2:01962] mca: base: components_open: component tcp has no register 
> function
> [machine2:01962] mca: base: components_open: component tcp open function 
> successful
> Daemon [[1418,0],1] checking in as pid 1962 on host machine2
> Daemon [[1418,0],1] not using static ports
> 
> At this point the system hangs indefinitely. While running top on the 
> machine2 terminal, I see several things come up briefly. These items are: 
> sshd (root), tcsh (myuser), orted (myuser), and mcstransd (root). I was 
> wondering if sshd needs to be initiated by myuser? It is currently turned off 
> in sshd_config through UsePAM yes. This was setup by the sysadmin but it can 
> be worked around if this is necessary.
> 
> So in summary, mpirun works on each machine individually, but hangs when 
> initiated through a hostfile or with the -host flag. ./configure with 
> defaults and --prefix. LD_LIBRARY_PATH and PATH set up correctly. Any help is 
> appreciated. Thanks!
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to