On 2 Sep 2010, at 15:56, Brock Palen wrote: > Ashly still having trouble using padb with openmpi/1.4.2 > > [dianawon@nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q > [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: > Communication retries exceeded. Can not communicate with peer > [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in > file util/comm/comm.c at line 62 > [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in > file orte-ps.c at line 799 > [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: > Communication retries exceeded. Can not communicate with peer > No active jobs could be found for user 'dianawon' > > The job is running, I get this error running just orte-ps,
If orte-ps isn't running correctly then there is very little padb can do, if that is the case try using the "mpirun" resource manager interface rather than "orte", this will cause padb to use the MPIR interface and try to get the information directly from the mpirun process before launching itself via pdsh. It doesn't scale as well as the orte integration (pdsh runs out of file descriptors eventually) but is more generic and might get you to somewhere that works. If your job spans more than 32 nodes you may need to set the FANOUT variable for pdsh to work. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk