We have ddt, but we do not have licenses to attach to the number of cores these 
jobs run at.

I tried padb,  but it fails, 

Example:

ssh to root node for running MPI job:
/tmp/padb -Q -a

[nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
Communication retries exceeded.  Can not communicate with peer
[nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in 
file util/comm/comm.c at line 62
[nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in 
file orte-ps.c at line 799
[nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
Communication retries exceeded.  Can not communicate with peer
einner: 
--------------------------------------------------------------------------
einner: orterun was unable to launch the specified application as it could not 
access
einner: or execute an executable:
Unexpected EOF from Inner stdout (connecting)
Unexpected EOF from Inner stderr (connecting)
Unexpected exit from parallel command (state=connecting)
Bad exit code from parallel command (exit_code=131)



Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Sep 1, 2010, at 5:32 PM, Ashley Pittman wrote:

> 
> On 1 Sep 2010, at 21:13, Brock Palen wrote:
> 
>> I have a code for a user (namd if anyone cares)  that on a specific case 
>> will lock up,  a quick ltrace shows the processes doing Iprobes over and 
>> over, so this makes me think that a process someplace is blocking on 
>> communication.  
>> 
>> What is the best way to look at message queues? To see what process is stuck 
>> and to drill into.
> 
> The only three programs I know which can do this are TotalView, DDT and Padb. 
>  Totalview and DDT are graphical parallel debuggers and are commercial 
> projects, Padb is a command-line tool and is open-source
> 
> Ashley (padb developer)
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Reply via email to