Re: [Hdf-forum] Collective IO and filters

Dana Robinson Thu, 09 Nov 2017 09:24:58 -0800

In develop, H5MM_malloc() and H5MM_calloc() will throw an assert if size is 
zero. That should not be there and the function docs even say that we return 
NULL on size zero.


The bad line is at lines 271 and 360 in H5MM.c if you want to try yanking that 
out and rebuilding.

Dana

On 11/9/17, 09:06, "Hdf-forum on behalf of Michael K. Edwards" 
<hdf-forum-boun...@lists.hdfgroup.org on behalf of m.k.edwa...@gmail.com> wrote:

    Actually, it's not the H5Screate() that crashes; that works fine since
    HDF5 1.8.7.  It's a zero-sized malloc somewhere inside the call to
    H5Dwrite(), possibly in the filter.  I think this is close to
    resolution; just have to get tools on it.
    
    On Thu, Nov 9, 2017 at 8:47 AM, Michael K. Edwards
    <m.k.edwa...@gmail.com> wrote:
    > Apparently this has been reported before as a problem with PETSc/HDF5
    > integration:  
https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html
    >
    > On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards
    > <m.k.edwa...@gmail.com> wrote:
    >> Thank you for the validation, and for the suggestion to use
    >> H5Sselect_none().  That is probably the right thing for the dataspace.
    >> Not quite sure what to do about the memspace, though; the comment is
    >> correct that we crash if any of the dimensions is zero.
    >>
    >> On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson
    >> <jhender...@hdfgroup.org> wrote:
    >>> It seems you're discovering the issues right as I'm typing this!
    >>>
    >>>
    >>> I'm glad you were able to solve the issue with the hanging. I was 
starting
    >>> to suspect an issue with the MPI implementation but it's usually the 
last
    >>> thing on the list after inspecting the code itself.
    >>>
    >>>
    >>> As you've seen, it seems that PETSc is creating a NULL dataspace for the
    >>> ranks which are not contributing, instead of creating a Scalar/Simple
    >>> dataspace on all ranks and calling H5Sselect_none() for those that don't
    >>> participate. This would most likely explain the reason you saw the 
assertion
    >>> failure in the non-filtered case, as the legacy code probably was not
    >>> expecting to receive a NULL dataspace. On top of that, the NULL 
dataspace
    >>> seems like it is causing the parallel operation to break collective 
mode,
    >>> which is not allowed when filters are involved. I would need to do some
    >>> research as to why this happens before deciding whether it's more
    >>> appropriate to modify this in HDF5 or to have PETSc not use NULL 
dataspaces.
    >>>
    >>>
    >>> Avoiding deadlock from the final sort has been an issue I had to 
re-tackle a
    >>> few different times due to the nature of the code's complexity, but I 
will
    >>> investigate using the chunk offset as a secondary sort key and see if it
    >>> will run into problems in any other cases. Ideally, the chunk 
redistribution
    >>> might be updated in the future to involve all ranks in the operation 
instead
    >>> of just rank 0, also allowing for improvements to the redistribution
    >>> algorithm that may solve these problems, but for the time being this 
may be
    >>> sufficient.
    
    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    Hdf-forum@lists.hdfgroup.org
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5
    

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Collective IO and filters

Reply via email to