Prakash,

tm_poll: protocol number dis error 11
ret is 17002 instead of 0: tm_init failed
3 processes killed (possibly by Open MPI)

I encountered similar problem with OpenPBS before, which also uses the TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I tried to call tm_init for the second time (which in turns call tm_poll and returned that errno).

I think what you did to start tm_init from another node and connect to another mom which I do not think is allowed. The TM module in OpenMPI already called tm_init once. I am curious to know about the reason that you need to call tm_init again?

If you are curious to know about the implementation for PBS, you can download the source from openpbs.org. OpenPBS source: v2.3.16/src/lib/Libifl/tm.c

--

Thanks,

- Pak Lui
pak....@sun.com

Reply via email to