Pak Lui wrote:
Prakash,
tm_poll: protocol number dis error 11
ret is 17002 instead of 0: tm_init failed
3 processes killed (possibly by Open MPI)
I encountered similar problem with OpenPBS before, which also uses the
TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I tried to
call tm_init for the second time (which in turns call tm_poll and
returned that errno).
I think what you did to start tm_init from another node and connect to
another mom which I do not think is allowed. The TM module in OpenMPI
already called tm_init once. I am curious to know about the reason that
you need to call tm_init again?
If you are curious to know about the implementation for PBS, you can
download the source from openpbs.org. OpenPBS source:
v2.3.16/src/lib/Libifl/tm.c
I am interested in getting this to work as I am working on implementing
support for dynamic scheduling in Torque. I want any node in an MPI-2
job (basically Open MPI implementation) to be able to request the
Torque/PBS server for more nodes. I am doing a little study in that
right now. Instead of nodes talking directly to the server, I want them
to be able to talk to Mother Superior and MS instead will talk to the
Server.
Could you please explain why this does not work now? And why it works
when I do the tm_init from MS, and only does not work from any other MOM?
Thanks,
Prakash