Re: [Hdf-forum] Collective IO and filters

Michael K. Edwards Thu, 09 Nov 2017 09:54:37 -0800

Thank you.  That got me farther along.  The crash is now in the
H5Z-blosc filter glue, and should be easy to fix.  It's interesting
that the filter is applied on a per-chunk basis, including on
zero-sized chunks; it's possible that something is wrong higher up the
stack.  I haven't really thought about collective read with filters
yet.  Jordan, can you fill me in on how that's supposed to work,
especially if the reader has a different number of MPI ranks than the
writer had?


HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3277 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed
  #008: /home/centos/blosc/hdf5-blosc/src/blosc_filter.c line 250 in
blosc_filter(): Can't allocate decompression buffer
    major: Data filters
    minor: Callback failed

On Thu, Nov 9, 2017 at 9:22 AM, Dana Robinson <derob...@hdfgroup.org> wrote:
> In develop, H5MM_malloc() and H5MM_calloc() will throw an assert if size is 
> zero. That should not be there and the function docs even say that we return 
> NULL on size zero.
>
> The bad line is at lines 271 and 360 in H5MM.c if you want to try yanking 
> that out and rebuilding.
>
> Dana
>
> On 11/9/17, 09:06, "Hdf-forum on behalf of Michael K. Edwards" 
> <hdf-forum-boun...@lists.hdfgroup.org on behalf of m.k.edwa...@gmail.com> 
> wrote:
>
>     Actually, it's not the H5Screate() that crashes; that works fine since
>     HDF5 1.8.7.  It's a zero-sized malloc somewhere inside the call to
>     H5Dwrite(), possibly in the filter.  I think this is close to
>     resolution; just have to get tools on it.
>
>     On Thu, Nov 9, 2017 at 8:47 AM, Michael K. Edwards
>     <m.k.edwa...@gmail.com> wrote:
>     > Apparently this has been reported before as a problem with PETSc/HDF5
>     > integration:  
> https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html
>     >
>     > On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards
>     > <m.k.edwa...@gmail.com> wrote:
>     >> Thank you for the validation, and for the suggestion to use
>     >> H5Sselect_none().  That is probably the right thing for the dataspace.
>     >> Not quite sure what to do about the memspace, though; the comment is
>     >> correct that we crash if any of the dimensions is zero.
>     >>
>     >> On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson
>     >> <jhender...@hdfgroup.org> wrote:
>     >>> It seems you're discovering the issues right as I'm typing this!
>     >>>
>     >>>
>     >>> I'm glad you were able to solve the issue with the hanging. I was 
> starting
>     >>> to suspect an issue with the MPI implementation but it's usually the 
> last
>     >>> thing on the list after inspecting the code itself.
>     >>>
>     >>>
>     >>> As you've seen, it seems that PETSc is creating a NULL dataspace for 
> the
>     >>> ranks which are not contributing, instead of creating a Scalar/Simple
>     >>> dataspace on all ranks and calling H5Sselect_none() for those that 
> don't
>     >>> participate. This would most likely explain the reason you saw the 
> assertion
>     >>> failure in the non-filtered case, as the legacy code probably was not
>     >>> expecting to receive a NULL dataspace. On top of that, the NULL 
> dataspace
>     >>> seems like it is causing the parallel operation to break collective 
> mode,
>     >>> which is not allowed when filters are involved. I would need to do 
> some
>     >>> research as to why this happens before deciding whether it's more
>     >>> appropriate to modify this in HDF5 or to have PETSc not use NULL 
> dataspaces.
>     >>>
>     >>>
>     >>> Avoiding deadlock from the final sort has been an issue I had to 
> re-tackle a
>     >>> few different times due to the nature of the code's complexity, but I 
> will
>     >>> investigate using the chunk offset as a secondary sort key and see if 
> it
>     >>> will run into problems in any other cases. Ideally, the chunk 
> redistribution
>     >>> might be updated in the future to involve all ranks in the operation 
> instead
>     >>> of just rank 0, also allowing for improvements to the redistribution
>     >>> algorithm that may solve these problems, but for the time being this 
> may be
>     >>> sufficient.
>
>     _______________________________________________
>     Hdf-forum is for HDF software users discussion.
>     Hdf-forum@lists.hdfgroup.org
>     http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>     Twitter: https://twitter.com/hdf5
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Collective IO and filters

Reply via email to