Re: [OMPI users] Connection problem on Linux cluster

LOTFIFAR F. Sat, 28 Mar 2015 14:38:54 -0400 (EDT)

More precisely,  I create VMs using OpenStack web interface. I have been 
assigned some resources by the administrator. I created VM instances, each with 
2 VPCU using OpenStack dashboard. So I do not know whether VMs  are assigned on 
the same/different physical nodes.

FYI: testing with the following command  on fehg_node_0 gives me this output.

> mpirun --mca plm_base_verbose 20 --host fehg_node_1 hostname

[fehg-node-0:02057] mca: base: components_open: Looking for plm components
[fehg-node-0:02057] mca: base: components_open: opening plm components
[fehg-node-0:02057] mca: base: components_open: found loaded component rsh
[fehg-node-0:02057] mca: base: components_open: component rsh has no register 
function
[fehg-node-0:02057] mca: base: components_open: component rsh open function 
successful
[fehg-node-0:02057] mca: base: components_open: found loaded component slurm
[fehg-node-0:02057] mca: base: components_open: component slurm has no register 
function
[fehg-node-0:02057] mca: base: components_open: component slurm open function 
successful
[fehg-node-0:02057] mca:base:select: Auto-selecting plm components
[fehg-node-0:02057] mca:base:select:(  plm) Querying component [rsh]
[fehg-node-0:02057] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[fehg-node-0:02057] mca:base:select:(  plm) Querying component [slurm]
[fehg-node-0:02057] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[fehg-node-0:02057] mca:base:select:(  plm) Selected component [rsh]
[fehg-node-0:02057] mca: base: close: component slurm closed
[fehg-node-0:02057] mca: base: close: unloading component slurm
[fehg-node-7:02660] mca: base: components_open: Looking for plm components
[fehg-node-7:02660] mca: base: components_open: opening plm components
[fehg-node-7:02660] mca: base: components_open: found loaded component rsh
[fehg-node-7:02660] mca: base: components_open: component rsh has no register 
function
[fehg-node-7:02660] mca: base: components_open: component rsh open function 
successful
[fehg-node-7:02660] mca:base:select: Auto-selecting plm components
[fehg-node-7:02660] mca:base:select:(  plm) Querying component [rsh]
[fehg-node-7:02660] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[fehg-node-7:02660] mca:base:select:(  plm) Selected component [rsh]

and it freezes here.

Regards,
Karos

________________________________
From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: 28 March 2015 18:23
To: Open MPI Users
Subject: Re: [OMPI users] Connection problem on Linux cluster

Just to be clear: do you have two physical nodes? Or just one physical node and 
you are running two VMs on it?

On Mar 28, 2015, at 10:51 AM, LOTFIFAR F. 
<foad.lotfi...@durham.ac.uk<mailto:foad.lotfi...@durham.ac.uk>> wrote:

I have a floating IP for accessing nodes from outside of the cluster and 
internal ip addresses. I tried to run the jobs with both of them (both ip 
addresses) but it makes no difference.
I have just installed openmpi 1.6.5 to see how does this version works. In this 
case I get nothing and I have to press Crtl+c. not output or error is shown.

________________________________
From: users [users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>] on 
behalf of Ralph Castain [r...@open-mpi.org<mailto:r...@open-mpi.org>]
Sent: 28 March 2015 17:03
To: Open MPI Users
Subject: Re: [OMPI users] Connection problem on Linux cluster

You mentioned running this in a VM - is that IP address correct for getting 
across the VMs?

On Mar 28, 2015, at 8:38 AM, LOTFIFAR F. 
<foad.lotfi...@durham.ac.uk<mailto:foad.lotfi...@durham.ac.uk>> wrote:

Hi ,

I am wondering how can I solve this problem.
System Spec:
1- Linux cluster with two nodes (master and slave) with Ubuntu 12.04 LTS 32bit.
2- openmpi 1.8.4

I do a simple test running on fehg_node_0:
> mpirun -host fehg_node_0,fehg_node_1 hello_world -mca oob_base_verbose 20

and I get the following error:

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    fehg-node-0
  Remote host:   10.104.5.40
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

Verbose:
1- I have full access to the VMs on the cluster and setup everything myself
2- Firewall and iptables are all disabled on the nodes
3- nodes can ssh to each other with  no problem
4- non-interactive bash calls works fine i.e. when I run ssh othernode env | 
grep PATH from both nodes, both PATH and LD_LIBRARY_PATH are set correctly
5- I have checked the posts, a similar problem reported for Solaris but I could 
not find a clue about mine.
6- run with --enable-orterun-prefix-by-default does not make any changes.
7-  I see orte is running on the other node when I check processes, but nothing 
happens after that and the error happens.

Regards,
Karos
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/03/26555.php

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/03/26557.php

Re: [OMPI users] Connection problem on Linux cluster

Reply via email to