Thanks,
I've tried padb first to get stack traces. This is from IMB-MPI1
hanging after one hour, the last output was:
# Benchmarking Alltoall
# #processes = 1024
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.04 0.09 0.05
1 1000 253.40 335.35 293.06
2 1000 266.93 346.65 306.23
4 1000 303.52 382.41 342.21
8 1000 383.89 493.56 439.34
16 1000 501.27 627.84 569.80
32 1000 1039.65 1259.70 1163.12
64 1000 1710.12 2071.47 1910.62
128 1000 3051.68 3653.44 3398.65
On Fri, Dec 1, 2017 at 4:23 PM, Gilles Gouaillardet
<[email protected]> wrote:
> FWIW,
>
> pstack <pid>
> Is a gdb wrapper that displays the stack trace.
>
> PADB http://padb.pittman.org.uk is a great OSS that automatically collect
> the stack traces of all the MPI tasks (and can do some grouping similar to
> dshbak)
>
> Cheers,
>
> Gilles
>
>
> Noam Bernstein <[email protected]> wrote:
>
> On Dec 1, 2017, at 8:10 AM, Götz Waschk <[email protected]> wrote:
>
> On Fri, Dec 1, 2017 at 10:13 AM, Götz Waschk <[email protected]> wrote:
>
> I have attached my slurm job script, it will simply do an mpirun
> IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for
> instance, vader is enabled.
>
> I have tested again, with
> mpirun --mca btl "^vader" IMB-MPI1
> it made no difference.
>
>
> I’ve lost track of the earlier parts of this thread, but has anyone
> suggested logging into the nodes it’s running on, doing “gdb -p PID” for
> each of the mpi processes, and doing “where” to see where it’s hanging?
>
> I use this script (trace_all), which depends on a variable process that is a
> grep regexp that matches the mpi executable:
>
> echo "where" > /tmp/gf
>
> pids=`ps aux | grep $process | grep -v grep | grep -v trace_all | awk
> '{print \$2}'`
> for pid in $pids; do
> echo $pid
> prog=`ps auxw | grep " $pid " | grep -v grep | awk '{print $11}'`
> gdb -x /tmp/gf -batch $prog $pid
> echo ""
> done
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://lists.open-mpi.org/mailman/listinfo/users
--
AL I:40: Do what thou wilt shall be the whole of the Law.
Stack trace(s) for thread: 1
-----------------
[0-1023] (1024 processes)
-----------------
main() at ?:?
IMB_init_buffers_iter() at ?:?
IMB_alltoall() at ?:?
-----------------
[0-31,35,42,118,163,235] (37 processes)
-----------------
PMPI_Barrier() at ?:?
ompi_coll_base_barrier_intra_recursivedoubling() at ?:?
ompi_request_default_wait() at ?:?
opal_progress() at ?:?
-----------------
[32-34,36-41,43-117,119-162,164-234,236-1023] (987 processes)
-----------------
PMPI_Alltoall() at ?:?
ompi_coll_base_alltoall_intra_basic_linear() at ?:?
ompi_request_default_wait_all() at ?:?
-----------------
[32-34,36-41,43-117,119-162,164-234,236-413,415-532,534-651,653-744,746-894,896-1023]
(982 processes)
-----------------
opal_progress() at ?:?
-----------------
[533] (1 processes)
-----------------
opal_progress@plt() at ?:?
Stack trace(s) for thread: 2
-----------------
[0-1023] (1024 processes)
-----------------
start_thread() at ?:?
progress_engine() at ?:?
opal_libevent2022_event_base_loop() at event.c:1630
epoll_dispatch() at epoll.c:407
epoll_wait() at ?:?
Stack trace(s) for thread: 3
-----------------
[0-1023] (1024 processes)
-----------------
start_thread() at ?:?
progress_engine() at ?:?
opal_libevent2022_event_base_loop() at event.c:1630
poll_dispatch() at poll.c:165
poll() at ?:?
_______________________________________________
users mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/users