In the last few years, it has been very simple to
set up high-performance (GbE) multiple back-to-back
connections between three nodes (triangular topology)
or four nodes (tetrahedral topology).

The only things you had to do was
- use 3 (or 4) cheap compute nodes w/Linux and connect
  each of them via standard GbE router (onboard GbE NIC)
  to a file server,
- put 2 (trigonal topol.) or 3 (tetrahedral topol.)
  $25 PCIe-GbE-NICs into *each* node,
- connect the nodes with 3 (trigonal) or 4 (tetrahedral)
  short crossover Cat5e cables,
- configure the extra NICs into different subnets
  according to their "edge index", eg.
  for 3 nodes (node10, node11, node12)
    node10
      onboard NIC: 192.168.0.10 on eth0 (to router/server)
      extra NIC: 10.0.1.10 on eth1 (edge 1 to 10.0.1.11)
      extra NIC: 10.0.2.10 on eth2 (edge 2 to 10.0.2.12)
    node11
      onboard NIC: 192.168.0.11 on eth0 (to router/server)
      extra NIC: 10.0.1.11 on eth1 (edge 1 to 10.0.1.10)
      extra NIC: 10.0.3.11 on eth3 (edge 3 to 10.0.3.12)
    node12
      onboard NIC: 192.168.0.12 on eth0 (to router/server)
      extra NIC: 10.0.2.12 on eth2 (edge 2 to 10.0.2.10)
      extra NIC: 10.0.3.12 on eth3 (edge 3 to 10.0.3.11)
- that's it. I mean, that *was* it, with 1.2.x.

OMPI 1.2.x would then ingeniously discover the routable edges
and open communication ports accordingly without any additional
explicit host routing, eg. invoked by

$> mpirun -np 12 --host c10,c11,c12 --mca btl_tcp_if_exclude lo,eth0  my_mpi_app

and (measured by iftop) saturate the available edges with
about 100MB/sec duplex on each of them. It would not stumble
on the fact, that some interfaces are not reacheable by
every NIC directly. And this was very convenient over the years.

With 1.4.3 (which comes out of the box) w/actual Linux distributions,
this won't work. It would hang and complain after timeout about failed
endpoint connects, eg:

[node12][[52378,1],2][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 10.0.1.11 failed: Connection timed out (110)

* Can the intelligent behaviour of 1.2.x be "configured back"?

* How should the topology look like to work with 1,4,x painlessly?

Thanks & regards

M.


Reply via email to