Re: [Hdf-forum] Collective IO and filters

Michael K. Edwards Wed, 08 Nov 2017 12:05:19 -0800

Thanks, Jordan.  I recognize that this is very recent feature work and
my goal is to help push it forward.


My current use case is relatively straightforward, though there are a
couple of layers on top of HDF5 itself.  The problem can be reproduced
by building PETSc 3.8.1 against libraries built from the develop
branch of HDF5, adding in the H5Dset_filter() calls, and running an
example that exercises them.  (I'm using
src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to
induce HDF5 writes.)  If you want, I can supply full details for you
to reproduce it locally, or I can do any experiments you'd like me to
within this setup.  (It also involves patches to the out-of-tree H5Z
plugins to make them use H5MM_malloc/H5MM_xfree rather than raw
malloc/free, which in turn involves exposing H5MMprivate.h to the
plugins.  Is this something you've solved in a different way?)


On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson
<jhender...@hdfgroup.org> wrote:
> Hi Michael,
>
>
> during the design phase of this feature I tried to both account for and test
> the case where some of the writers do not have any data to contribute.
> However, it seems like your use case falls outside of what I have tested
> (perhaps I have not used enough ranks?). In particular my test cases were
> small and simply had some of the ranks call H5Sselect_none(), which doesn't
> seem to trigger this particular assertion failure. Is this how you're
> approaching these particular ranks in your code or is there a different way
> you are having them participate in the write operation?
>
>
> As for the hanging issue, it looks as though rank 0 is waiting to receive
> some modification data from another rank for a particular chunk. Whether or
> not there is actually valid data that rank 0 should be waiting for, I cannot
> easily tell without being able to trace it through. As the other ranks have
> finished modifying their particular sets of chunks, they have moved on and
> are waiting for everyone to get together and broadcast their new chunk sizes
> so that free space in the file can be collectively re-allocated, but of
> course rank 0 is not proceeding forward. My best guess is that either:
>
>
> The "num_writers" field for the chunk struct corresponding to the particular
> chunk that rank 0 is working on has been incorrectly set, causing rank 0 to
> think that there are more ranks writing to the chunk than the actual amount
> and consequently causing rank 0 to wait forever for a non-existent MPI
> message
>
>
> or
>
>
> The "new_owner" field of the chunk struct for this chunk was incorrectly set
> on the other ranks, causing them to never issue an MPI_Isend to rank 0, also
> causing rank 0 to wait for a non-existent MPI message
>
>
> This feature should still be regarded as being in beta and its complexity
> can lead to difficult to track down bugs such as the ones you are currently
> encountering. That being said, your feedback is very useful and will help to
> push this feature towards a production-ready level of quality. Also, if it
> is feasible to come up with a minimal example that reproduces this issue, it
> would be very helpful and would make it much easier to diagnose why exactly
> these failures are occurring.
>
> Thanks,
> Jordan
>
> ________________________________
> From: Hdf-forum <hdf-forum-boun...@lists.hdfgroup.org> on behalf of Michael
> K. Edwards <m.k.edwa...@gmail.com>
> Sent: Wednesday, November 8, 2017 11:23 AM
> To: Miller, Mark C.
> Cc: HDF Users Discussion List
> Subject: Re: [Hdf-forum] Collective IO and filters
>
> Closer to 1000 ranks initially.  There's a bug in handling the case
> where some of the writers don't have any data to contribute (because
> there's a dimension smaller than the number of ranks), which I have
> worked around like this:
>
> diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
> index af6599a..9522478 100644
> --- a/src/H5Dchunk.c
> +++ b/src/H5Dchunk.c
> @@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t
> *fm)
>          /* Indicate that the chunk's memory space is shared */
>          chunk_info->mspace_shared = TRUE;
>      } /* end if */
> +    else if(H5SL_count(fm->sel_chunks)==0) {
> +        /* No chunks, because no local data; avoid
> HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
> +    } /* end else if */
>      else {
>          /* Get bounding box for file selection */
>          if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end)
> < 0)
>
> That makes the assert go away.  Now I'm investigating a hang in the
> chunk redistribution logic in rank 0, with a backtrace that looks like
> this:
>
> #0  0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
> #1  0x00007f4bd5d3b341 in psm_progress_wait () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #2  0x00007f4bd5d3012d in MPID_Mprobe () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #3  0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #4  0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
> local_chunk_array=0x17f0f80,
>     local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
> #5  0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
> chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
>     at H5Dmpio.c:2794
> #6  0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
> dx_plist=0x16f7230) at H5Dmpio.c:1447
> #7  0x00007f4bd81a027d in H5D__chunk_collective_io
> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
> H5Dmpio.c:933
> #8  0x00007f4bd81a0968 in H5D__chunk_collective_write
> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
> file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
> H5Dmpio.c:1018
> #9  0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
> mem_type_id=216172782113783851, mem_space=0x17dc770,
> file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
> H5Dio.c:835
> #10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
> direct_write=false, mem_type_id=216172782113783851,
> mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
> buf=0x17d6240)
>     at H5Dio.c:394
> #11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>     buf=0x17d6240) at H5Dio.c:318
>
> The other ranks have moved past this and are hanging here:
>
> #0  0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
> #1  0x00007feb6fe25341 in psm_progress_wait () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #2  0x00007feb6fdd8975 in MPIC_Wait () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #3  0x00007feb6fdd918b in MPIC_Sendrecv () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #4  0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #5  0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #6  0x00007feb6fca1534 in MPIR_Allreduce_impl () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #7  0x00007feb6fca1b93 in PMPI_Allreduce () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #8  0x00007feb72287c2a in H5D__mpio_array_gatherv
> (local_array=0x125f2d0, local_array_num_entries=0,
> array_entry_size=368, _gathered_array=0x7ffff083f1d8,
>     _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
> allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
> H5Dmpio.c:479
> #9  0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
> dx_plist=0x11cf240) at H5Dmpio.c:1479
> #10 0x00007feb7228a27d in H5D__chunk_collective_io
> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
> H5Dmpio.c:933
> #11 0x00007feb7228a968 in H5D__chunk_collective_write
> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
> file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
> H5Dmpio.c:1018
> #12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
> mem_type_id=216172782113783851, mem_space=0x124b450,
> file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
> H5Dio.c:835
> #13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
> direct_write=false, mem_type_id=216172782113783851,
> mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
> buf=0x1244e80)
>     at H5Dio.c:394
> #14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>     buf=0x1244e80) at H5Dio.c:318
>
> (I'm currently running with this patch atop commit bf570b1, on an
> earlier theory that the crashing bug may have crept in after Jordan's
> big merge.  I'll rebase on current develop but I doubt that'll change
> much.)
>
> The hang may or may not be directly related to the workaround being a
> bit of a hack.  I can set you up with full reproduction details if you
> like; I seem to be getting some traction on it, but more eyeballs are
> always good, especially if they're better set up for MPI tracing than
> I am right now.
>
>
> On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <mille...@llnl.gov> wrote:
>> Hi Michael,
>>
>>
>>
>> I have not tried this in parallel yet. That said, what scale are you
>> trying
>> to do this at? 1000 ranks or 1,000,000 ranks? Something in between?
>>
>>
>>
>> My understanding is that there are some known scaling issues out past
>> maybe
>> 10,000 ranks. Not heard of outright assertion failures there though.
>>
>>
>>
>> Mark
>>
>>
>>
>>
>>
>> "Hdf-forum on behalf of Michael K. Edwards" wrote:
>>
>>
>>
>> I'm trying to write an HDF5 file with dataset compression from an MPI
>>
>> job.  (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
>>
>> After running into the "Parallel I/O does not support filters yet"
>>
>> error message in release versions of HDF5, I have turned to the
>>
>> develop branch.  Clearly there has been much work towards collective
>>
>> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
>>
>> it is not quite ready for prime time yet.  So far I've encountered a
>>
>> livelock scenario with ZFP, reproduced it with SZIP, and, with no
>>
>> filters at all, obtained this nifty error message:
>>
>>
>>
>> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
>>
>> `fm->m_ndims==fm->f_ndims' failed.
>>
>>
>>
>> Has anyone on this list been able to write parallel HDF5 using a
>>
>> recent state of the develop branch, with or without filters
>>
>> configured?
>>
>>
>>
>> Thanks,
>>
>> - Michael
>>
>>
>>
>> _______________________________________________
>>
>> Hdf-forum is for HDF software users discussion.
>>
>> Hdf-forum@lists.hdfgroup.org
>>
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>
>> Twitter: https://twitter.com/hdf5
>>
>>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@lists.hdfgroup.org
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
> The HDF Group (@hdf5) | Twitter
> twitter.com
> The latest Tweets from The HDF Group (@hdf5). Technologies and supporting
> services that make possible the management of large, complex data
> collections. Support ...
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Collective IO and filters

Reply via email to