Gleb / George --

Is there an easy way for us to put a cap on max recusion down in opal_progress? Just put in a counter in opal_progress() such that if it exceeds some max value, return success without doing anything (if opal_progress_event_flag indicates that nothing *needs* to be done)?

This wouldn't *guarantee* that we cap recursion (because we sometimes need progress before returning), but it might help for the cases that are described in this mail (we're eagerly progressing when we don't necessarily need to)...?


On Feb 22, 2008, at 9:48 AM, John Markus Bjørndalen wrote:

Hi,

I ran into a bug when running a few microbenchmarks for OpenMPI. I had
thrown in Reduce and Gather for sanity checking, but OpenMPI crashed
when running those operations. Usually, this would happen when I reached
around 12-16 nodes.

My current crash-test code looks like this (I've removed a few lines
that were commented out):

----------- snip-------------
#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
   int rank, size, count = 1;
   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Comm_size(MPI_COMM_WORLD, &size);

   if (argc > 1)
   count = atoi(argv[1]);

   int n, i;
   // Just make sure we have plenty of buffer for any operation
   int *sbuf = malloc(sizeof(int) * 2 * count);
   int *rbuf = malloc(sizeof(int) * 2 * count);
   assert(sbuf);
   assert(rbuf);

   for (n = 1; n <= 10000; n += 100) {
       printf("N = %d\n", n);
       fflush(stdout);
       for (i = 0; i < n; i++) {
           MPI_Reduce(sbuf, rbuf, count, MPI_INT, MPI_SUM, 0,
MPI_COMM_WORLD);
       }
       MPI_Barrier(MPI_COMM_WORLD);
       printf(" -- DONE\n");
       fflush(stdout);
       MPI_Barrier(MPI_COMM_WORLD);
   }
   MPI_Finalize();
   return 0;
}
----------- snip-------------


I can usually trigger a crash with count=1, and n=1000 using 16+ nodes, but I can also trigger it with 44 nodes and larger packets (around 32k ints I think). I can also crash it on a single host using 19 processes,
but then it usually doesn't crash until I reach somewhere between
1200-3000 iterations. Gather seems to have the same problems as Reduce.

The output from running gdb on the coredump looks like this:

----------- snip-------------
Using host libthread_db library "/lib/tls/libthread_db.so.1".
Core was generated by `./ompi-crash2'.
Program terminated with signal 11, Segmentation fault.
#0  0x00434184 in sysconf () from /lib/tls/libc.so.6
#0  0x00434184 in sysconf () from /lib/tls/libc.so.6
#1  0xb7e78b59 in _int_malloc () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#2 0xb7e799ce in malloc () from /home/johnm/local/ompi/lib/libopen- pal.so.0
#3  0xb7f04852 in ompi_free_list_grow () from
/home/johnm/local/ompi/lib/libmpi.so.0
#4  0xb7d74e70 in mca_btl_tcp_endpoint_recv_handler () from
/home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
#5  0xb7e62b44 in opal_event_base_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#6  0xb7e62cff in opal_event_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#7  0xb7e5d284 in opal_progress () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#8  0xb7d74f08 in mca_btl_tcp_endpoint_recv_handler () from
/home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
#9  0xb7e62b44 in opal_event_base_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#10 0xb7e62cff in opal_event_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#11 0xb7e5d284 in opal_progress () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#12 0xb7d74f08 in mca_btl_tcp_endpoint_recv_handler () from
/home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
#13 0xb7e62b44 in opal_event_base_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#14 0xb7e62cff in opal_event_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0

... and then continues until...

#1356848 0xb7e5d284 in opal_progress () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#1356849 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
/home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
#1356850 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
/home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
#1356851 0xb7e62b44 in opal_event_base_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#1356852 0xb7e62cff in opal_event_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#1356853 0xb7e5d284 in opal_progress () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#1356854 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
/home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
#1356855 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
/home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
#1356856 0xb7e62b44 in opal_event_base_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#1356857 0xb7e62cff in opal_event_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#1356858 0xb7e5d284 in opal_progress () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#1356859 0xb7d8f389 in mca_pml_ob1_recv_frag_match () from
/home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
#1356860 0xb7d74a7d in mca_btl_tcp_endpoint_recv_handler () from
/home/johnm/local/ompi/lib/openmpi/mca_btl_tcp.so
#1356861 0xb7e62b44 in opal_event_base_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#1356862 0xb7e62cff in opal_event_loop () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#1356863 0xb7e5d284 in opal_progress () from
/home/johnm/local/ompi/lib/libopen-pal.so.0
#1356864 0xb7d8cb69 in mca_pml_ob1_recv () from
/home/johnm/local/ompi/lib/openmpi/mca_pml_ob1.so
#1356865 0xb7d5bb1c in ompi_coll_tuned_reduce_intra_basic_linear () from
/home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
#1356866 0xb7d55913 in ompi_coll_tuned_reduce_intra_dec_fixed () from
/home/johnm/local/ompi/lib/openmpi/mca_coll_tuned.so
#1356867 0xb7f3db6c in PMPI_Reduce () from
/home/johnm/local/ompi/lib/libmpi.so.0
#1356868 0x0804899e in main (argc=1, argv=0xbfba8a84) at ompi- crash2.c:58
----------- snip-------------

I poked around in the code, and it looks like the culprit might be in
the macros that try to allocate fragments in
mca_pml_ob1_recv_frag_match: MCA_PML_OB1_RECV_FRAG_ALLOC and
MCA_PML_OB1_RECV_FRAG_INIT use OMPI_FREE_LIST_WAIT, which again can end
up calling opal_condition_wait(). opal_condition_wait() calls
opal_progress() to "block", which looks like it leads to infinite
recursion in this case.

I guess the problem is a race condition when one node is hammered with
incoming packets.

The stack trace contains about 1.35 million lines, so I won't include
all of it here, but here's some statistics to verify that not much else
is happening in that stack (I can make the full trace available if
anybody needs it):

----------- snip-------------
Number of callframes:  1356870
Called function statistics (how often in stackdump):
 PMPI_Reduce                                        1
 _int_malloc                                        1
 main                                               1
 malloc                                             1
 mca_btl_tcp_endpoint_recv_handler             339197
 mca_pml_ob1_recv                                   1
 mca_pml_ob1_recv_frag_match                       72
 ompi_coll_tuned_reduce_intra_basic_linear           1
 ompi_coll_tuned_reduce_intra_dec_fixed             1
 ompi_free_list_grow                                1
 opal_event_base_loop                          339197
 opal_event_loop                               339197
 opal_progress                                 339197
 sysconf                                            2
Address statistics (how often in stackdump), plus functions with that addr
(sanity check):
 0x00434184                                         2 set(['sysconf'])
 0x0804899e                                         1 set(['main'])
 0xb7d55913                                         1
set(['ompi_coll_tuned_reduce_intra_dec_fixed'])
 0xb7d5bb1c                                         1
set(['ompi_coll_tuned_reduce_intra_basic_linear'])
 0xb7d74a7d                                        72
set(['mca_btl_tcp_endpoint_recv_handler'])
 0xb7d74e70                                         1
set(['mca_btl_tcp_endpoint_recv_handler'])
 0xb7d74f08                                    339124
set(['mca_btl_tcp_endpoint_recv_handler'])
 0xb7d8cb69                                         1
set(['mca_pml_ob1_recv'])
 0xb7d8f389                                        72
set(['mca_pml_ob1_recv_frag_match'])
 0xb7e5d284                                    339197
set(['opal_progress'])
 0xb7e62b44                                    339197
set(['opal_event_base_loop'])
 0xb7e62cff                                    339197
set(['opal_event_loop'])
0xb7e78b59 1 set(['_int_malloc'])
 0xb7e799ce                                         1 set(['malloc'])
 0xb7f04852                                         1
set(['ompi_free_list_grow'])
0xb7f3db6c 1 set(['PMPI_Reduce'])
----------- snip-------------

I don't have any suggestions for a fix though, since this is the first
time I've looked into the OpenMPI code.

Btw. In case it makes a difference for triggering the bug: I'm running
this on a cluster with 1 frontend and 44 nodes. The cluster runs Rocks
4.1, and each of the nodes are 3.2GHz P4 Prescott machines with 2GB RAM,
connected with gigabit Ethernet.


Regards,

--
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems


Reply via email to