[Forgot the attachment.]

On Thu, Dec 22, 2011 at 15:16, Jed Brown <j...@59a2.org> wrote:

> I wrote a new communication layer that we are evaluating for use in mesh
> management and PDE solvers, but it is based on MPI-2 one-sided operations
> (and will eventually benefit from some of the MPI-3 one-sided proposals,
> especially MPI_Fetch_and_op() and dynamic windows). All the basic
> functionality works well with MPICH2, but I have run into some Open MPI
> bugs regarding one-sided operations with composite data types. This email
> provides a reduced test case for two such bugs. I see that there are also
> some existing serious-looking bug reports regarding one-sided operations,
> but they are getting pretty old now and haven't seen action in a while.
>
> https://svn.open-mpi.org/trac/ompi/ticket/2656
> https://svn.open-mpi.org/trac/ompi/ticket/1905
>
> Is there a plan for resolving these in the near future?
>
> Is anyone using Open MPI for serious work with one-sided operations?
>
>
> Bugs I am reporting:
>
> *1.* If an MPI_Win is used with an MPI_Datatype, even if the MPI_Win
> operation has completed, I get an invalid free when MPI_Type_free() is
> called before MPI_Win_free(). Since MPI_Type_free() is only supposed to
> mark the datatype for deletion, the implementation should properly manage
> reference counting. If you run the attached code with
>
> $ mpiexec -n 2 ./a.out 1
>
> (which only does part of the comm described for the second bug, below),
> you can see the invalid free on rank 1 with stack still in MPI_Win_fence()
>
> (gdb) bt
> #0  0x00007ffff7288905 in raise () from /lib/libc.so.6
> #1  0x00007ffff7289d7b in abort () from /lib/libc.so.6
> #2  0x00007ffff72c147e in __libc_message () from /lib/libc.so.6
> #3  0x00007ffff72c7396 in malloc_printerr () from /lib/libc.so.6
> #4  0x00007ffff72cb26c in free () from /lib/libc.so.6
> #5  0x00007ffff7a5aaa8 in ompi_datatype_release_args (pData=0x845010) at
> ompi_datatype_args.c:414
> #6  0x00007ffff7a5b0ea in __ompi_datatype_release (datatype=0x845010) at
> ompi_datatype_create.c:47
> #7  0x00007ffff218e772 in opal_obj_run_destructors (object=0x845010) at
> ../../../../opal/class/opal_object.h:448
> #8  ompi_osc_rdma_replyreq_free (replyreq=0x680a80) at
> osc_rdma_replyreq.h:136
> #9  ompi_osc_rdma_replyreq_send_cb (btl=0x7ffff3680ce0,
> endpoint=<optimized out>, descriptor=0x837b00, status=<optimized out>) at
> osc_rdma_data_move.c:691
> #10 0x00007ffff347f38f in mca_btl_sm_component_progress () at
> btl_sm_component.c:645
> #11 0x00007ffff7b1f80a in opal_progress () at runtime/opal_progress.c:207
> #12 0x00007ffff21977c5 in opal_condition_wait (m=<optimized out>,
> c=0x842ee0) at ../../../../opal/threads/condition.h:99
> #13 ompi_osc_rdma_module_fence (assert=0, win=0x842270) at
> osc_rdma_sync.c:207
> #14 0x00007ffff7a89db5 in PMPI_Win_fence (assert=0, win=0x842270) at
> pwin_fence.c:60
> #15 0x00000000004010d8 in main (argc=2, argv=0x7fffffffd508) at win.c:60
>
> meanwhile, rank 0 has already freed the datatype and is waiting in
> MPI_Win_free().
> (gdb) bt
> #0  0x00007ffff7312107 in sched_yield () from /lib/libc.so.6
> #1  0x00007ffff7b1f82b in opal_progress () at runtime/opal_progress.c:220
> #2  0x00007ffff7a53fe4 in opal_condition_wait (m=<optimized out>,
> c=<optimized out>) at ../opal/threads/condition.h:99
> #3  ompi_request_default_wait_all (count=2, requests=0x7fffffffd210,
> statuses=0x7fffffffd1e0) at request/req_wait.c:263
> #4  0x00007ffff25b8d71 in ompi_coll_tuned_sendrecv_actual (sendbuf=0x0,
> scount=0, sdatatype=0x7ffff7dba840, dest=1, stag=-16, recvbuf=<optimized
> out>, rcount=0, rdatatype=0x7ffff7dba840, source=1, rtag=-16,
> comm=0x8431a0, status=0x0) at coll_tuned_util.c:54
> #5  0x00007ffff25c2de2 in ompi_coll_tuned_barrier_intra_two_procs
> (comm=<optimized out>, module=<optimized out>) at coll_tuned_barrier.c:256
> #6  0x00007ffff25b92ab in ompi_coll_tuned_barrier_intra_dec_fixed
> (comm=0x8431a0, module=0x844980) at coll_tuned_decision_fixed.c:190
> #7  0x00007ffff2186248 in ompi_osc_rdma_module_free (win=0x842170) at
> osc_rdma.c:46
> #8  0x00007ffff7a58a44 in ompi_win_free (win=0x842170) at win/win.c:150
> #9  0x00007ffff7a8a0dd in PMPI_Win_free (win=0x7fffffffd408) at
> pwin_free.c:56
> #10 0x0000000000401195 in main (argc=2, argv=0x7fffffffd508) at win.c:69
>
>
> *2.* This appears to be more fundamental and perhaps much harder to fix.
> The attached code sets up the following graph
>
> rank 0:
> 0 -> (1,0)
> 1 -> nothing
> 2 -> (1,1)
>
> rank 1:
> 0 -> (0,0)
> 1 -> (0,2)
> 2 -> (0,1)
>
> We pull over this graph using two calls to MPI_Get(), each with composite
> data types defining what to pull into the first two slots, and what to put
> into the third slot. It is Valgrind-clean with MPICH2, and produces the
> following:
>
> $ mpiexec.hydra -n 2 ./a.out 2
> [0] provided [100,101,102]  got [200, -2,201]
> [1] provided [200,201,202]  got [100,102,101]
>
> With Open MPI, I see
>
> a.out: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr)
> (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct
> malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >=
> (unsigned long)((((__builtin_offsetof (struct malloc_chunk,
> fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) -
> 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) ==
> 0)' failed.
>
> on both ranks, with rank 0 at
>
> (gdb) bt
> #0  0x00007ffff7288905 in raise () from /lib/libc.so.6
> #1  0x00007ffff7289d7b in abort () from /lib/libc.so.6
> #2  0x00007ffff72c675d in __malloc_assert () from /lib/libc.so.6
> #3  0x00007ffff72c96d3 in _int_malloc () from /lib/libc.so.6
> #4  0x00007ffff72cad5d in malloc () from /lib/libc.so.6
> #5  0x00007ffff7b46c46 in opal_free_list_grow (flist=0x7ffff239f150,
> num_elements=1) at class/opal_free_list.c:93
> #6  0x00007ffff2196152 in ompi_osc_rdma_replyreq_alloc
> (replyreq=0x7fffffffd0f8, origin_rank=1, module=0x842d10) at
> osc_rdma_replyreq.h:82
> #7  ompi_osc_rdma_replyreq_alloc_init (module=0x842d10, origin=1,
> origin_request=..., target_displacement=0, target_count=1,
> datatype=0x8455b0, replyreq=0x7fffffffd0f8) at osc_rdma_replyreq.c:40
> #8  0x00007ffff218c051 in component_fragment_cb (btl=0x7ffff3680ce0,
> tag=<optimized out>, descriptor=<optimized out>, cbdata=<optimized out>) at
> osc_rdma_component.c:633
> #9  0x00007ffff347f25f in mca_btl_sm_component_progress () at
> btl_sm_component.c:623
> #10 0x00007ffff7b1f80a in opal_progress () at runtime/opal_progress.c:207
> #11 0x00007ffff21977c5 in opal_condition_wait (m=<optimized out>,
> c=0x842de0) at ../../../../opal/threads/condition.h:99
> #12 ompi_osc_rdma_module_fence (assert=0, win=0x842170) at
> osc_rdma_sync.c:207
> #13 0x00007ffff7a89db5 in PMPI_Win_fence (assert=0, win=0x842170) at
> pwin_fence.c:60
> #14 0x00000000004010d8 in main (argc=2, argv=0x7fffffffd508) at win.c:60
>
> and rank 1 at
>
> (gdb) bt
> #0  0x00007ffff7288905 in raise () from /lib/libc.so.6
> #1  0x00007ffff7289d7b in abort () from /lib/libc.so.6
> #2  0x00007ffff72c675d in __malloc_assert () from /lib/libc.so.6
> #3  0x00007ffff72c96d3 in _int_malloc () from /lib/libc.so.6
> #4  0x00007ffff72cad5d in malloc () from /lib/libc.so.6
> #5  0x00007ffff7a5b3ce in opal_obj_new (cls=0x7ffff7db2060) at
> ../../opal/class/opal_object.h:469
> #6  opal_obj_new_debug (line=71, file=0x7ffff7b60323
> "ompi_datatype_create.c", type=0x7ffff7db2060) at
> ../../opal/class/opal_object.h:251
> #7  ompi_datatype_create (expectedSize=3) at ompi_datatype_create.c:71
> #8  0x00007ffff7a5b7e9 in ompi_datatype_create_indexed_block (count=1,
> bLength=1, pDisp=0x7fffee18a834, oldType=0x7ffff7db3640,
> newType=0x7fffffffd070) at ompi_datatype_create_indexed.c:124
> #9  0x00007ffff7a5a349 in __ompi_datatype_create_from_args (type=9,
> d=0x844f40, a=0x7fffee18a828, i=0x7fffee18a82c) at ompi_datatype_args.c:691
> #10 __ompi_datatype_create_from_packed_description
> (packed_buffer=0x7fffffffd108, remote_processor=0x652b90) at
> ompi_datatype_args.c:626
> #11 0x00007ffff7a5b045 in ompi_datatype_create_from_packed_description
> (packed_buffer=<optimized out>, remote_processor=<optimized out>) at
> ompi_datatype_args.c:779
> #12 0x00007ffff218bf60 in ompi_osc_base_datatype_create
> (payload=0x7fffffffd108, remote_proc=<optimized out>) at
> ../../../../ompi/mca/osc/base/osc_base_obj_convert.h:52
> #13 component_fragment_cb (btl=0x7ffff3680ce0, tag=<optimized out>,
> descriptor=<optimized out>, cbdata=<optimized out>) at
> osc_rdma_component.c:624
> #14 0x00007ffff347f25f in mca_btl_sm_component_progress () at
> btl_sm_component.c:623
> #15 0x00007ffff7b1f80a in opal_progress () at runtime/opal_progress.c:207
> #16 0x00007ffff21977c5 in opal_condition_wait (m=<optimized out>,
> c=0x842ee0) at ../../../../opal/threads/condition.h:99
> #17 ompi_osc_rdma_module_fence (assert=0, win=0x842270) at
> osc_rdma_sync.c:207
> #18 0x00007ffff7a89db5 in PMPI_Win_fence (assert=0, win=0x842270) at
> pwin_fence.c:60
> #19 0x00000000004010d8 in main (argc=2, argv=0x7fffffffd508) at win.c:60
>
> This looks like memory corruption, but Open MPI internals are too noisy
> under valgrind for it to be obvious where to look. This is with Open MPI
> 1.5.4, but I observed the same thing with trunk. If I run with three
> processes, the graph is slightly different and only ranks 1 and 2 error
> (rank 0 hangs).
>
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
  MPI_Win win;
  int bugnumber,tslots[2][2],oslots[2][2],ranks[2];
  int i,rank,size,*target,*origin,n[2];
  MPI_Datatype ttype[2],otype[2];

  MPI_Init(&argc,&argv);
  MPI_Comm_size(MPI_COMM_WORLD,&size);
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  if (size < 2 || argc != 2) {
    if (!rank) fprintf(stderr,"usage: mpiexec -n 2 %s BUG_NUMBER\n",argv[0]);
    MPI_Finalize();
    return 1;
  }
  bugnumber = atoi(argv[1]);

  ranks[0] = rank ? 0 : size-1;
  ranks[1] = (rank+1)%size;

  n[0] = 0;
  tslots[0][n[0]++] = 0;
  if (rank) tslots[0][n[0]++] = 2;
  n[1] = 0;
  tslots[1][n[1]++] = 1;

  oslots[0][0] = 0;
  oslots[0][1] = 1;
  oslots[1][0] = 2;

  for (i=0; i<2; i++) {
    MPI_Type_create_indexed_block(n[i],1,tslots[i],MPI_INT,&ttype[i]);
    MPI_Type_create_indexed_block(n[i],1,oslots[i],MPI_INT,&otype[i]);
    MPI_Type_commit(&ttype[i]);
    MPI_Type_commit(&otype[i]);
  }
  MPI_Alloc_mem(3*sizeof(*target),MPI_INFO_NULL,&target);
  for (i=0; i<3; i++) target[i] = 100*(rank+1) + i;
  origin = malloc(3*sizeof(*origin));
  for (i=0; i<3; i++) origin[i] = -i-1;

  MPI_Win_create(target,(MPI_Aint)3*sizeof(*target),sizeof(*target),MPI_INFO_NULL,MPI_COMM_WORLD,&win);
  MPI_Win_fence(0,win);
  switch (bugnumber) {
  case 1:                       /* OMPI: This operation succeeds, but MPI_Win_free() fails below */
    for (i=0; i<1; i++) MPI_Get(origin,1,otype[i],ranks[i],0,1,ttype[i],win);
    break;
  case 2:                       /* OMPI: Failure in MPI_Win_fence() on rank != 1 */
    for (i=0; i<2; i++) MPI_Get(origin,1,otype[i],ranks[i],0,1,ttype[i],win);
    break;
  default:
    if (!rank) fprintf(stderr,"Unknown bugnumber %d\n",bugnumber);
    MPI_Finalize();
    return 1;
  }
  MPI_Win_fence(0,win);

  printf("[%d] provided [%3d,%3d,%3d]  got [%3d,%3d,%3d]\n",rank,target[0],target[1],target[2],origin[0],origin[1],origin[2]);

  free(origin);
  for (i=0; i<2; i++) {
    MPI_Type_free(&otype[i]);
    MPI_Type_free(&ttype[i]);
  }
  MPI_Win_free(&win);
  MPI_Free_mem(target);
  MPI_Finalize();
  return 0;
}

Reply via email to