More precisely, I create VMs using OpenStack web interface. I have been assigned some resources by the administrator. I created VM instances, each with 2 VPCU using OpenStack dashboard. So I do not know whether VMs are assigned on the same/different physical nodes.
FYI: testing with the following command on fehg_node_0 gives me this output. > mpirun --mca plm_base_verbose 20 --host fehg_node_1 hostname [fehg-node-0:02057] mca: base: components_open: Looking for plm components [fehg-node-0:02057] mca: base: components_open: opening plm components [fehg-node-0:02057] mca: base: components_open: found loaded component rsh [fehg-node-0:02057] mca: base: components_open: component rsh has no register function [fehg-node-0:02057] mca: base: components_open: component rsh open function successful [fehg-node-0:02057] mca: base: components_open: found loaded component slurm [fehg-node-0:02057] mca: base: components_open: component slurm has no register function [fehg-node-0:02057] mca: base: components_open: component slurm open function successful [fehg-node-0:02057] mca:base:select: Auto-selecting plm components [fehg-node-0:02057] mca:base:select:( plm) Querying component [rsh] [fehg-node-0:02057] mca:base:select:( plm) Query of component [rsh] set priority to 10 [fehg-node-0:02057] mca:base:select:( plm) Querying component [slurm] [fehg-node-0:02057] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [fehg-node-0:02057] mca:base:select:( plm) Selected component [rsh] [fehg-node-0:02057] mca: base: close: component slurm closed [fehg-node-0:02057] mca: base: close: unloading component slurm [fehg-node-7:02660] mca: base: components_open: Looking for plm components [fehg-node-7:02660] mca: base: components_open: opening plm components [fehg-node-7:02660] mca: base: components_open: found loaded component rsh [fehg-node-7:02660] mca: base: components_open: component rsh has no register function [fehg-node-7:02660] mca: base: components_open: component rsh open function successful [fehg-node-7:02660] mca:base:select: Auto-selecting plm components [fehg-node-7:02660] mca:base:select:( plm) Querying component [rsh] [fehg-node-7:02660] mca:base:select:( plm) Query of component [rsh] set priority to 10 [fehg-node-7:02660] mca:base:select:( plm) Selected component [rsh] and it freezes here. Regards, Karos ________________________________ From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain [r...@open-mpi.org] Sent: 28 March 2015 18:23 To: Open MPI Users Subject: Re: [OMPI users] Connection problem on Linux cluster Just to be clear: do you have two physical nodes? Or just one physical node and you are running two VMs on it? On Mar 28, 2015, at 10:51 AM, LOTFIFAR F. <foad.lotfi...@durham.ac.uk<mailto:foad.lotfi...@durham.ac.uk>> wrote: I have a floating IP for accessing nodes from outside of the cluster and internal ip addresses. I tried to run the jobs with both of them (both ip addresses) but it makes no difference. I have just installed openmpi 1.6.5 to see how does this version works. In this case I get nothing and I have to press Crtl+c. not output or error is shown. ________________________________ From: users [users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain [r...@open-mpi.org<mailto:r...@open-mpi.org>] Sent: 28 March 2015 17:03 To: Open MPI Users Subject: Re: [OMPI users] Connection problem on Linux cluster You mentioned running this in a VM - is that IP address correct for getting across the VMs? On Mar 28, 2015, at 8:38 AM, LOTFIFAR F. <foad.lotfi...@durham.ac.uk<mailto:foad.lotfi...@durham.ac.uk>> wrote: Hi , I am wondering how can I solve this problem. System Spec: 1- Linux cluster with two nodes (master and slave) with Ubuntu 12.04 LTS 32bit. 2- openmpi 1.8.4 I do a simple test running on fehg_node_0: > mpirun -host fehg_node_0,fehg_node_1 hello_world -mca oob_base_verbose 20 and I get the following error: A process or daemon was unable to complete a TCP connection to another process: Local host: fehg-node-0 Remote host: 10.104.5.40 This is usually caused by a firewall on the remote host. Please check that any firewall (e.g., iptables) has been disabled and try again. ------------------------------------------------------------ -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). Verbose: 1- I have full access to the VMs on the cluster and setup everything myself 2- Firewall and iptables are all disabled on the nodes 3- nodes can ssh to each other with no problem 4- non-interactive bash calls works fine i.e. when I run ssh othernode env | grep PATH from both nodes, both PATH and LD_LIBRARY_PATH are set correctly 5- I have checked the posts, a similar problem reported for Solaris but I could not find a clue about mine. 6- run with --enable-orterun-prefix-by-default does not make any changes. 7- I see orte is running on the other node when I check processes, but nothing happens after that and the error happens. Regards, Karos _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/03/26555.php _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/03/26557.php