We have ddt, but we do not have licenses to attach to the number of cores these jobs run at.
I tried padb, but it fails, Example: ssh to root node for running MPI job: /tmp/padb -Q -a [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: Communication retries exceeded. Can not communicate with peer [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in file util/comm/comm.c at line 62 [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in file orte-ps.c at line 799 [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: Communication retries exceeded. Can not communicate with peer einner: -------------------------------------------------------------------------- einner: orterun was unable to launch the specified application as it could not access einner: or execute an executable: Unexpected EOF from Inner stdout (connecting) Unexpected EOF from Inner stderr (connecting) Unexpected exit from parallel command (state=connecting) Bad exit code from parallel command (exit_code=131) Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On Sep 1, 2010, at 5:32 PM, Ashley Pittman wrote: > > On 1 Sep 2010, at 21:13, Brock Palen wrote: > >> I have a code for a user (namd if anyone cares) that on a specific case >> will lock up, a quick ltrace shows the processes doing Iprobes over and >> over, so this makes me think that a process someplace is blocking on >> communication. >> >> What is the best way to look at message queues? To see what process is stuck >> and to drill into. > > The only three programs I know which can do this are TotalView, DDT and Padb. > Totalview and DDT are graphical parallel debuggers and are commercial > projects, Padb is a command-line tool and is open-source > > Ashley (padb developer) > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >