padb as a binary (it's a perl script) needs to exist on all nodes as it calls orterun on itself, try installing it to a shared directory or copying padb to /tmp on every node.
To access the message queues padb needs a compiled helper program which is installed in $PREFIX/lib so I would recommend re-building padb giving it a prefix of a NFS shared directory. I can help you more with this if required. Ashley, On 1 Sep 2010, at 23:01, Brock Palen wrote: > We have ddt, but we do not have licenses to attach to the number of cores > these jobs run at. > > I tried padb, but it fails, > > Example: > > ssh to root node for running MPI job: > /tmp/padb -Q -a > > [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: > Communication retries exceeded. Can not communicate with peer > [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in > file util/comm/comm.c at line 62 > [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in > file orte-ps.c at line 799 > [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: > Communication retries exceeded. Can not communicate with peer > einner: > -------------------------------------------------------------------------- > einner: orterun was unable to launch the specified application as it could > not access > einner: or execute an executable: > Unexpected EOF from Inner stdout (connecting) > Unexpected EOF from Inner stderr (connecting) > Unexpected exit from parallel command (state=connecting) > Bad exit code from parallel command (exit_code=131) -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk