My Intel IMB-MPI tests stall, but only in very specific cases:larger
packet sizes + large core counts. Only happens for bcast, gather and
exchange tests. Only for the larger core counts (~256 cores). Other
tests like pingpong and sendrecev run fine even with larger core
counts.
e.g. This bcast tes
Yes, that is correct - we reserve the first port in the range for a daemon,
should one exist.
The problem is clearly that get_node_rank is returning the wrong value for
the second process (your rank=1). If you want to dig deeper, look at the
orte/mca/ess/generic code where it generates the nidmap
Ralph,
somewhere in ./orte/mca/oob/tcp/oob_tcp.c, there is this comment:
orte_node_rank_t nrank;
/* do I know my node_local_rank yet? */
if (ORTE_NODE_RANK_INVALID != (nrank =
orte_ess.get_node_rank(ORTE_PROC_MY_NAME)) &&
(nrank+
Something doesn't look right - here is what the algo attempts to do:
given a port range of 1-12000, the lowest rank'd process on the node
should open port 1. The next lowest rank on the node will open 10001,
etc.
So it looks to me like there is some confusion in the local rank algo. I'll
Ralph,
I'm able to use the generic module when the processes are on different machines.
what would be the values of the EV when two processes are on the same
machine (hopefully talking over SHM).
i've played with combination of nodelist and ppn but no luck. I get errors like:
[c0301b10e1:0317
I am pleased to announce that Open MPI now supports checkpoint/restart process
migration and automatic recovery. This is in addition to our current support
for more traditional checkpoint/restart fault tolerance. These new features
were introduced in the Open MPI development trunk in commit r235