Re: [julia-users] Advice on (perhaps) chunking to HDF5

Erik Schnetter Tue, 13 Sep 2016 12:47:59 -0700

On Tue, Sep 13, 2016 at 11:27 AM, sparrowhawker <[email protected]>
wrote:


> Hi,
>
> I'm new to Julia, and have been able to accomplish a lot of what I used to
> do in Matlab/Fortran, in very little time since I started using Julia in
> the last three months. Here's my newest stumbling block.
>
> I have a process which creates nsamples within a loop. Each sample takes a
> long time to compute as there are expensive finite difference operations,
> which ultimately lead to a sample, say 1 to 10 seconds. I have to store
> each of the nsamples, and I know the size and dimensions of each of the
> nsamples (all samples have the same size and dimensions). However,
> depending on the run time parameters, each sample may be a 32x32 image or
> perhaps a 64x64x64 voxset with 3 attributes, i.e., a 64x64x64x3
> hyper-rectangle. To be clear, each sample can be an arbitrary dimension
> hyper-rectangle, specified at run time.
>
> Obviously, since I don't want to lose computation and want to see
> incremental progress, I'd like to do incremental saves of these samples on
> disk, instead of waiting to collect all nsamples at the end. For instance,
> if I had to store 1000 samples of size 64x64, I thought perhaps I could
> chunk and save 64x64 slices to an HDF5 file 1000 times. Is this the right
> approach? If so, here's a prototype program to do so, but it depends on my
> knowing the number of dimensions of the slice, which is not known until
> runtime,
>
> using HDF5
>
> filename = "test.h5"
> # open file
> fmode ="w"
> # get a file object
> fid = h5open(filename, fmode)
> # matrix to write in chunks
> B = rand(64,64,1000)
> # figure out its dimensions
> sizeTuple = size(B)
> Ndims = length(sizeTuple)
> # set up to write in chunks of sizeArray
> sizeArray = ones(Int, Ndims)
> [sizeArray[i] = sizeTuple[i] for i in 1:(Ndims-1)] # last value of size
> array is :...:,1
> # create a dataset models within root
> dset = d_create(fid, "models", datatype(Float64), dataspace(size(B)),
> "chunk", sizeArray)
> [dset[:,:,i] = slicedim(B, Ndims, i) for i in 1:size(B, Ndims)]
> close(fid)
>
> This works, but the second last line, dset[:,:,i] requires syntax
> specific to writing a slice of a dimension 3 array - but I don't know the
> dimensions until run time. Of course I could just write to a flat binary
> file incrementally, but HDF5.jl could make my life so much simpler!
>

HDF5 supports "extensible datasets", which were created for use cases such
as this one. I don't recall the exact syntax, but if I recall correctly,
you can specify one dimension (the first one in C, the last one in Julia)
to be extensible, and then you can add more data as you go. You will
probably need to specify a chunk size, which could be the size of the
increment in your case. Given file system speeds, a chunk size smaller than
a few MegaBytes probably don't make much sense (i.e. will slow things down).

If you want to monitor the HDF5 file as it is being written, look at the
SWIMR feature. This requires HDF5 1.10; unfortunately, Julia will by
default often still install version 1.8.

If you want to protect against crashes of your code so that you don't lose
progress, then HDF5 is probably not right for you. Once an HDF5 file is
open for writing, the on-disk state might be inconsistent, so that you can
lose all data when your code crashes. In this case, you might want to write
data into different files, one per increment. HDF5 1.0 offers "views",
which are umbrella files that stitch together datasets stored in other
files.

If you are looking for generic advice for setting up things with HDF5, then
I recommend their documentation. If you are looking for how to access these
features in Julia, or if you notice a feature that is not available in
Julia, then we'll be happy to explain or correct things.

What do mean by "dimension only known at run time" -- do you mean what
Julia calls "size" (shape) or what Julia calls "dim" (rank)?

Do all datasets have the same size, or do they differ? If they differ, then
putting them into the same dataset might not make sense; in this case, I
would write them into different datasets.

-erik

-- 
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/

Re: [julia-users] Advice on (perhaps) chunking to HDF5

Reply via email to