Hello Everyone,

I've set up a cluster of 4 nodes where each node has 128 cores.  As it is a
mesh network, every node has a direct connection to every other node.  In
my case, it's 25Gbps fiber with TCP/IP.  I can run an MPI "hello world"
that touches all 512 cores across all nodes, and I can run that same task
from each node successfully.  However, I'm ultimately trying to run HPL 2.3
and it works if I run on 128 cores, regardless of where they run.  I can do
:

mpirun -n 128 -host node1:128,node2:128,node3:128,node4:128 xhpl

and that produces what looks like normal HPL output

If I set -n to anything above 128, I get an error that indicates a node
tried to reach another node by using a network it can't reach.

If I set -n to 512 (which seems to me that should be the optimal number) I
get two errors about impossible routes trying to be used.

I can follow up with the exact error messages, it's just difficult to get
it on my phone.

To provide background,  each physical link between each node is its own
"network" and the ip tables get pretty interesting, but I can ssh/scp from
every node to every node and as I mentioned, run an mpi hello world from
each node successfully so I'm inclined to think my network configuration is
correct.

I have hosts files in /etc/ on each node reflecting the correct IP to reach
the other nodes from that node's perspective, so they're all different but
they all appear correct.

I'm scratching my head as to why HPL tries to reach (for example) node3
from node4 but with an IP that only node2 should know about.

Any ideas?
Thank you

Reply via email to