I dont know about ompi-1.0.3 snapshots, but we use ompi-1.0.2 with both torque-2.0.0p8 and torque-2.1.0p0 using the tm interface without any problems.
Are you using PBSPro?  OpenPBS?
As for you mpiexec is that the one included with OpenMPI (just a symlink to orterun) or the one from
http://www.osc.edu/~pw/mpiexec/index.php

Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


On Jun 15, 2006, at 9:42 AM, Martin Schafföner wrote:

Hi,

I have been trying to set up OpenMPI 1.0.3a1r10374 on our cluster and was partly successful. Partly, because installation worked, compiling a simple example and running it through the rsh pls also worked. However, I'm the only user who has rsh access to the nodes, all other users must go through torque and launch mpi apps using torque's TM subsystem. That's where my problem
starts: I was not successful in launching apps through TM. TM pls is
configured okay, I can see it making connections to torque mom in mom's
logfile; however, the app never gets run. Even if I only request one
processor, mpiexec spawns several orted in a row. Here is my session log (where I kill mpiexec using CTRL-C cause it would otherwise run forever):

schaffoe@node16:~/tmp/mpitest> mpiexec -np 1 --mca pls_tm_debug 1 -- mca pls tm
`pwd`/openmpitest
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name
--num_procs 2 --vpid_start 0 --nodename  --universe
schaffoe@node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: found /opt/openmpi/bin/orted
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.1 --num_procs 2 --vpid_start 0 --nodename node16 --universe
schaffoe@node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name
--num_procs 3 --vpid_start 0 --nodename  --universe
schaffoe@node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.2 --num_procs 3 --vpid_start 0 --nodename node16 --universe
schaffoe@node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name
--num_procs 4 --vpid_start 0 --nodename  --universe
schaffoe@node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.3 --num_procs 4 --vpid_start 0 --nodename node16 --universe
schaffoe@node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
mpiexec: killing job...
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name
--num_procs 5 --vpid_start 0 --nodename  --universe
schaffoe@node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.4 --num_procs 5 --vpid_start 0 --nodename node16 --universe
schaffoe@node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name
--num_procs 6 --vpid_start 0 --nodename  --universe
schaffoe@node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.5 --num_procs 6 --vpid_start 0 --nodename node16 --universe
schaffoe@node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
---------------------------------------------------------------------- ----
WARNING: mpiexec encountered an abnormal exit.

This means that mpiexec exited before it received notification that all
started processes had terminated.  You should double check and ensure
that there are no runaway processes still executing.
---------------------------------------------------------------------- ----


I read in the README that TM pls is working, whereas in the latex usersguide
it says that only rsh and bproc are supported. I am confused...

Can anybody shed a better light on this?

Regards,
--
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and
Communication Technologies, Department of Electrical Engineering,
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to