[OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-19 Thread Rahul Nabar
My Intel IMB-MPI tests stall, but only in very specific cases:larger packet sizes + large core counts. Only happens for bcast, gather and exchange tests. Only for the larger core counts (~256 cores). Other tests like pingpong and sendrecev run fine even with larger core counts. e.g. This bcast tes

Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

2010-08-19 Thread Ralph Castain
Yes, that is correct - we reserve the first port in the range for a daemon, should one exist. The problem is clearly that get_node_rank is returning the wrong value for the second process (your rank=1). If you want to dig deeper, look at the orte/mca/ess/generic code where it generates the nidmap

Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

2010-08-19 Thread Philippe
Ralph, somewhere in ./orte/mca/oob/tcp/oob_tcp.c, there is this comment: orte_node_rank_t nrank; /* do I know my node_local_rank yet? */ if (ORTE_NODE_RANK_INVALID != (nrank = orte_ess.get_node_rank(ORTE_PROC_MY_NAME)) && (nrank+

Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

2010-08-19 Thread Ralph Castain
Something doesn't look right - here is what the algo attempts to do: given a port range of 1-12000, the lowest rank'd process on the node should open port 1. The next lowest rank on the node will open 10001, etc. So it looks to me like there is some confusion in the local rank algo. I'll

Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

2010-08-19 Thread Philippe
Ralph, I'm able to use the generic module when the processes are on different machines. what would be the values of the EV when two processes are on the same machine (hopefully talking over SHM). i've played with combination of nodelist and ppn but no luck. I get errors like: [c0301b10e1:0317

[OMPI users] Checkpoint/Restart Process Migration and Automatic Recovery Support

2010-08-19 Thread Joshua Hursey
I am pleased to announce that Open MPI now supports checkpoint/restart process migration and automatic recovery. This is in addition to our current support for more traditional checkpoint/restart fault tolerance. These new features were introduced in the Open MPI development trunk in commit r235