Thanks, Jordan. I recognize that this is very recent feature work and my goal is to help push it forward.
My current use case is relatively straightforward, though there are a couple of layers on top of HDF5 itself. The problem can be reproduced by building PETSc 3.8.1 against libraries built from the develop branch of HDF5, adding in the H5Dset_filter() calls, and running an example that exercises them. (I'm using src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to induce HDF5 writes.) If you want, I can supply full details for you to reproduce it locally, or I can do any experiments you'd like me to within this setup. (It also involves patches to the out-of-tree H5Z plugins to make them use H5MM_malloc/H5MM_xfree rather than raw malloc/free, which in turn involves exposing H5MMprivate.h to the plugins. Is this something you've solved in a different way?) On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson <jhender...@hdfgroup.org> wrote: > Hi Michael, > > > during the design phase of this feature I tried to both account for and test > the case where some of the writers do not have any data to contribute. > However, it seems like your use case falls outside of what I have tested > (perhaps I have not used enough ranks?). In particular my test cases were > small and simply had some of the ranks call H5Sselect_none(), which doesn't > seem to trigger this particular assertion failure. Is this how you're > approaching these particular ranks in your code or is there a different way > you are having them participate in the write operation? > > > As for the hanging issue, it looks as though rank 0 is waiting to receive > some modification data from another rank for a particular chunk. Whether or > not there is actually valid data that rank 0 should be waiting for, I cannot > easily tell without being able to trace it through. As the other ranks have > finished modifying their particular sets of chunks, they have moved on and > are waiting for everyone to get together and broadcast their new chunk sizes > so that free space in the file can be collectively re-allocated, but of > course rank 0 is not proceeding forward. My best guess is that either: > > > The "num_writers" field for the chunk struct corresponding to the particular > chunk that rank 0 is working on has been incorrectly set, causing rank 0 to > think that there are more ranks writing to the chunk than the actual amount > and consequently causing rank 0 to wait forever for a non-existent MPI > message > > > or > > > The "new_owner" field of the chunk struct for this chunk was incorrectly set > on the other ranks, causing them to never issue an MPI_Isend to rank 0, also > causing rank 0 to wait for a non-existent MPI message > > > This feature should still be regarded as being in beta and its complexity > can lead to difficult to track down bugs such as the ones you are currently > encountering. That being said, your feedback is very useful and will help to > push this feature towards a production-ready level of quality. Also, if it > is feasible to come up with a minimal example that reproduces this issue, it > would be very helpful and would make it much easier to diagnose why exactly > these failures are occurring. > > Thanks, > Jordan > > ________________________________ > From: Hdf-forum <hdf-forum-boun...@lists.hdfgroup.org> on behalf of Michael > K. Edwards <m.k.edwa...@gmail.com> > Sent: Wednesday, November 8, 2017 11:23 AM > To: Miller, Mark C. > Cc: HDF Users Discussion List > Subject: Re: [Hdf-forum] Collective IO and filters > > Closer to 1000 ranks initially. There's a bug in handling the case > where some of the writers don't have any data to contribute (because > there's a dimension smaller than the number of ranks), which I have > worked around like this: > > diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c > index af6599a..9522478 100644 > --- a/src/H5Dchunk.c > +++ b/src/H5Dchunk.c > @@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t > *fm) > /* Indicate that the chunk's memory space is shared */ > chunk_info->mspace_shared = TRUE; > } /* end if */ > + else if(H5SL_count(fm->sel_chunks)==0) { > + /* No chunks, because no local data; avoid > HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */ > + } /* end else if */ > else { > /* Get bounding box for file selection */ > if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end) > < 0) > > That makes the assert go away. Now I'm investigating a hang in the > chunk redistribution logic in rank 0, with a backtrace that looks like > this: > > #0 0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2 > #1 0x00007f4bd5d3b341 in psm_progress_wait () from > /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 > #2 0x00007f4bd5d3012d in MPID_Mprobe () from > /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 > #3 0x00007f4bd5cbeeb4 in PMPI_Mprobe () from > /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 > #4 0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks > (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0, > local_chunk_array=0x17f0f80, > local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041 > #5 0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list > (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0, > chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00) > at H5Dmpio.c:2794 > #6 0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io > (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0, > dx_plist=0x16f7230) at H5Dmpio.c:1447 > #7 0x00007f4bd81a027d in H5D__chunk_collective_io > (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at > H5Dmpio.c:933 > #8 0x00007f4bd81a0968 in H5D__chunk_collective_write > (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104, > file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at > H5Dmpio.c:1018 > #9 0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010, > mem_type_id=216172782113783851, mem_space=0x17dc770, > file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at > H5Dio.c:835 > #10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010, > direct_write=false, mem_type_id=216172782113783851, > mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384, > buf=0x17d6240) > at H5Dio.c:394 > #11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680, > mem_type_id=216172782113783851, mem_space_id=288230376151711749, > file_space_id=288230376151711750, dxpl_id=720575940379279384, > buf=0x17d6240) at H5Dio.c:318 > > The other ranks have moved past this and are hanging here: > > #0 0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2 > #1 0x00007feb6fe25341 in psm_progress_wait () from > /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 > #2 0x00007feb6fdd8975 in MPIC_Wait () from > /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 > #3 0x00007feb6fdd918b in MPIC_Sendrecv () from > /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 > #4 0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from > /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 > #5 0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from > /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 > #6 0x00007feb6fca1534 in MPIR_Allreduce_impl () from > /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 > #7 0x00007feb6fca1b93 in PMPI_Allreduce () from > /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 > #8 0x00007feb72287c2a in H5D__mpio_array_gatherv > (local_array=0x125f2d0, local_array_num_entries=0, > array_entry_size=368, _gathered_array=0x7ffff083f1d8, > _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4, > allgather=true, root=0, comm=-1006632952, sort_func=0x0) at > H5Dmpio.c:479 > #9 0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io > (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280, > dx_plist=0x11cf240) at H5Dmpio.c:1479 > #10 0x00007feb7228a27d in H5D__chunk_collective_io > (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at > H5Dmpio.c:933 > #11 0x00007feb7228a968 in H5D__chunk_collective_write > (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74, > file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at > H5Dmpio.c:1018 > #12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0, > mem_type_id=216172782113783851, mem_space=0x124b450, > file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at > H5Dio.c:835 > #13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0, > direct_write=false, mem_type_id=216172782113783851, > mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384, > buf=0x1244e80) > at H5Dio.c:394 > #14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680, > mem_type_id=216172782113783851, mem_space_id=288230376151711749, > file_space_id=288230376151711750, dxpl_id=720575940379279384, > buf=0x1244e80) at H5Dio.c:318 > > (I'm currently running with this patch atop commit bf570b1, on an > earlier theory that the crashing bug may have crept in after Jordan's > big merge. I'll rebase on current develop but I doubt that'll change > much.) > > The hang may or may not be directly related to the workaround being a > bit of a hack. I can set you up with full reproduction details if you > like; I seem to be getting some traction on it, but more eyeballs are > always good, especially if they're better set up for MPI tracing than > I am right now. > > > On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <mille...@llnl.gov> wrote: >> Hi Michael, >> >> >> >> I have not tried this in parallel yet. That said, what scale are you >> trying >> to do this at? 1000 ranks or 1,000,000 ranks? Something in between? >> >> >> >> My understanding is that there are some known scaling issues out past >> maybe >> 10,000 ranks. Not heard of outright assertion failures there though. >> >> >> >> Mark >> >> >> >> >> >> "Hdf-forum on behalf of Michael K. Edwards" wrote: >> >> >> >> I'm trying to write an HDF5 file with dataset compression from an MPI >> >> job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.) >> >> After running into the "Parallel I/O does not support filters yet" >> >> error message in release versions of HDF5, I have turned to the >> >> develop branch. Clearly there has been much work towards collective >> >> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly >> >> it is not quite ready for prime time yet. So far I've encountered a >> >> livelock scenario with ZFP, reproduced it with SZIP, and, with no >> >> filters at all, obtained this nifty error message: >> >> >> >> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion >> >> `fm->m_ndims==fm->f_ndims' failed. >> >> >> >> Has anyone on this list been able to write parallel HDF5 using a >> >> recent state of the develop branch, with or without filters >> >> configured? >> >> >> >> Thanks, >> >> - Michael >> >> >> >> _______________________________________________ >> >> Hdf-forum is for HDF software users discussion. >> >> Hdf-forum@lists.hdfgroup.org >> >> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >> >> Twitter: https://twitter.com/hdf5 >> >> > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > Hdf-forum@lists.hdfgroup.org > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 > The HDF Group (@hdf5) | Twitter > twitter.com > The latest Tweets from The HDF Group (@hdf5). Technologies and supporting > services that make possible the management of large, complex data > collections. Support ... > _______________________________________________ Hdf-forum is for HDF software users discussion. Hdf-forum@lists.hdfgroup.org http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5