padb as a binary (it's a perl script) needs to exist on all nodes as it calls 
orterun on itself, try installing it to a shared directory or copying padb to 
/tmp on every node.

To access the message queues padb needs a compiled helper program which is 
installed in $PREFIX/lib so I would recommend re-building padb giving it a 
prefix of a NFS shared directory.  I can help you more with this if required.

Ashley,

On 1 Sep 2010, at 23:01, Brock Palen wrote:

> We have ddt, but we do not have licenses to attach to the number of cores 
> these jobs run at.
> 
> I tried padb,  but it fails, 
> 
> Example:
> 
> ssh to root node for running MPI job:
> /tmp/padb -Q -a
> 
> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
> Communication retries exceeded.  Can not communicate with peer
> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in 
> file util/comm/comm.c at line 62
> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in 
> file orte-ps.c at line 799
> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
> Communication retries exceeded.  Can not communicate with peer
> einner: 
> --------------------------------------------------------------------------
> einner: orterun was unable to launch the specified application as it could 
> not access
> einner: or execute an executable:
> Unexpected EOF from Inner stdout (connecting)
> Unexpected EOF from Inner stderr (connecting)
> Unexpected exit from parallel command (state=connecting)
> Bad exit code from parallel command (exit_code=131)

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


Reply via email to