On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:

The reason your job is hanging is sitting in the orte-ps output. You have multiple processes declaring themselves to be the same MPI rank. That definitely won't work.

Its the "local rank" if that makes any difference...

Any thoughts on this output?

[xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[61029,1],3]

The question is why is that happening? We use Torque all the time, so we know that the basic support is correct. It -could- be related to lib confusion, but I can't tell for sure.

Just to be clear, this is not going through torque at this point. Its just vanilla ssh, for which this code worked with 1.1.5.


Can you rebuild OMPI with --enable-debug, and rerun the job with the following added to your cmd line?

-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5

Working on this...

Thanks,  Jody

Reply via email to