Oddly enough, it is not the tag that is mismatched between receiver
and senders; it is io_info->comm.  Something is decidedly out of whack
here.

Rank 0, owner 0 probing with tag 0 on comm -1006632942
Rank 2, owner 0 sent with tag 0 to comm -1006632952 as request 0
Rank 3, owner 0 sent with tag 0 to comm -1006632952 as request 0
Rank 1, owner 0 sent with tag 0 to comm -1006632952 as request 0


On Wed, Nov 8, 2017 at 2:51 PM, Michael K. Edwards
<m.k.edwa...@gmail.com> wrote:
>
> I see that you're re-sorting by owner using a comparator called
> H5D__cmp_filtered_collective_io_info_entry_owner() which does not sort
> by a secondary key within items with equal owners.  That, together
> with a sort that isn't stable (which HDqsort() probably isn't on most
> platforms; quicksort/introsort is not stable), will scramble the order
> in which different ranks traverse their local chunk arrays.  That will
> cause deadly embraces between ranks that are waiting for each other's
> chunks to be sent.  To fix that, it's probably sufficient to use the
> chunk offset as a secondary sort key in that comparator.
>
> That's not the root cause of the hang I'm currently experiencing,
> though.  Still digging into that.
>
>
> On Wed, Nov 8, 2017 at 1:50 PM, Dana Robinson <derob...@hdfgroup.org> wrote:
> > Yes. All outside code that frees, allocates, or reallocates memory created
> > inside the library (or that will be passed back into the library, where it
> > could be freed or reallocated) should use these functions. This includes
> > filters.
> >
> >
> >
> > Dana
> >
> >
> >
> > From: Jordan Henderson <jhender...@hdfgroup.org>
> > Date: Wednesday, November 8, 2017 at 13:46
> > To: Dana Robinson <derob...@hdfgroup.org>, "m.k.edwa...@gmail.com"
> > <m.k.edwa...@gmail.com>, HDF List <hdf-forum@lists.hdfgroup.org>
> > Subject: Re: [Hdf-forum] Collective IO and filters
> >
> >
> >
> > Dana,
> >
> >
> >
> > would it then make sense for all outside filters to use these routines? Due
> > to Parallel Compression's internal nature, it uses buffers allocated via
> > H5MM_ routines to collect and scatter data, which works fine for the
> > internal filters like deflate, since they use these as well. However, since
> > some of the outside filters use the raw malloc/free routines, causing
> > issues, I'm wondering if having all outside filters use the H5_ routines is
> > the cleanest solution..
> >
> >
> >
> > Michael,
> >
> >
> >
> > Based on the "num_writers: 4" field, the NULL "receive_requests_array" and
> > the fact that for the same chunk, rank 0 shows "original owner: 0, new
> > owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as
> > though everyone IS interested in the chunk the rank 0 is now working on, but
> > now I'm more confident that at some point either the messages may have
> > failed to send or rank 0 is having problems finding the messages.
> >
> >
> >
> > Since in the unfiltered case it won't hit this particular code path, I'm not
> > surprised that that case succeeds. If I had to make another guess based on
> > this, I would be inclined to think that rank 0 must be hanging on the
> > MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the
> > chunk as the tag for the message in order to funnel specific messages to the
> > correct rank for the correct chunk during the last part of the chunk
> > redistribution and if rank 0 can't match the tag it of course won't find the
> > message. Why this might be happening, I'm not entirely certain currently.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to