Thank you. That got me farther along. The crash is now in the H5Z-blosc filter glue, and should be easy to fix. It's interesting that the filter is applied on a per-chunk basis, including on zero-sized chunks; it's possible that something is wrong higher up the stack. I haven't really thought about collective read with filters yet. Jordan, can you fill me in on how that's supposed to work, especially if the reader has a different number of MPI ranks than the writer had?
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0: #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data major: Dataset minor: Write failed #001: H5Dio.c line 395 in H5D__pre_write(): can't write data major: Dataset minor: Write failed #002: H5Dio.c line 836 in H5D__write(): can't write data major: Dataset minor: Write failed #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error major: Dataspace minor: Write failed #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't finish filtered linked chunk MPI-IO major: Low-level I/O minor: Can't get value #005: H5Dmpio.c line 1474 in H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry major: Dataset minor: Write failed #006: H5Dmpio.c line 3277 in H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for modifying major: Data filters minor: Filter operation failed #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read major: Data filters minor: Read failed #008: /home/centos/blosc/hdf5-blosc/src/blosc_filter.c line 250 in blosc_filter(): Can't allocate decompression buffer major: Data filters minor: Callback failed On Thu, Nov 9, 2017 at 9:22 AM, Dana Robinson <derob...@hdfgroup.org> wrote: > In develop, H5MM_malloc() and H5MM_calloc() will throw an assert if size is > zero. That should not be there and the function docs even say that we return > NULL on size zero. > > The bad line is at lines 271 and 360 in H5MM.c if you want to try yanking > that out and rebuilding. > > Dana > > On 11/9/17, 09:06, "Hdf-forum on behalf of Michael K. Edwards" > <hdf-forum-boun...@lists.hdfgroup.org on behalf of m.k.edwa...@gmail.com> > wrote: > > Actually, it's not the H5Screate() that crashes; that works fine since > HDF5 1.8.7. It's a zero-sized malloc somewhere inside the call to > H5Dwrite(), possibly in the filter. I think this is close to > resolution; just have to get tools on it. > > On Thu, Nov 9, 2017 at 8:47 AM, Michael K. Edwards > <m.k.edwa...@gmail.com> wrote: > > Apparently this has been reported before as a problem with PETSc/HDF5 > > integration: > https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html > > > > On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards > > <m.k.edwa...@gmail.com> wrote: > >> Thank you for the validation, and for the suggestion to use > >> H5Sselect_none(). That is probably the right thing for the dataspace. > >> Not quite sure what to do about the memspace, though; the comment is > >> correct that we crash if any of the dimensions is zero. > >> > >> On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson > >> <jhender...@hdfgroup.org> wrote: > >>> It seems you're discovering the issues right as I'm typing this! > >>> > >>> > >>> I'm glad you were able to solve the issue with the hanging. I was > starting > >>> to suspect an issue with the MPI implementation but it's usually the > last > >>> thing on the list after inspecting the code itself. > >>> > >>> > >>> As you've seen, it seems that PETSc is creating a NULL dataspace for > the > >>> ranks which are not contributing, instead of creating a Scalar/Simple > >>> dataspace on all ranks and calling H5Sselect_none() for those that > don't > >>> participate. This would most likely explain the reason you saw the > assertion > >>> failure in the non-filtered case, as the legacy code probably was not > >>> expecting to receive a NULL dataspace. On top of that, the NULL > dataspace > >>> seems like it is causing the parallel operation to break collective > mode, > >>> which is not allowed when filters are involved. I would need to do > some > >>> research as to why this happens before deciding whether it's more > >>> appropriate to modify this in HDF5 or to have PETSc not use NULL > dataspaces. > >>> > >>> > >>> Avoiding deadlock from the final sort has been an issue I had to > re-tackle a > >>> few different times due to the nature of the code's complexity, but I > will > >>> investigate using the chunk offset as a secondary sort key and see if > it > >>> will run into problems in any other cases. Ideally, the chunk > redistribution > >>> might be updated in the future to involve all ranks in the operation > instead > >>> of just rank 0, also allowing for improvements to the redistribution > >>> algorithm that may solve these problems, but for the time being this > may be > >>> sufficient. > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > Hdf-forum@lists.hdfgroup.org > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 > > _______________________________________________ Hdf-forum is for HDF software users discussion. Hdf-forum@lists.hdfgroup.org http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5