Re: [julia-users] Advice on (perhaps) chunking to HDF5

Anandaroop Ray Tue, 13 Sep 2016 18:37:32 -0700

Cool! The colons approach makes sense to me, followed by splatting.

I'm unfamiliar with the syntax here but when I try to create a tuple in the
REPL using


inds = ((:) for i in 1:3)

I get
ERROR: syntax: missing separator in tuple



On 13 September 2016 at 17:27, Erik Schnetter <[email protected]> wrote:

> If you have a varying rank, then you should probably use something like
> `CartesianIndex` and `CartesianRange` to represent the indices, or possible
> tuples of integers. You would then use the splatting operator to create the
> indexing instructions:
>
> ```Julia
> indrange = CartesianRange(xyz)
> dset[indrange..., i] = slicedim
> ```
>
> I don't know whether the expression `indrange...` works as-is, or whether
> you have to manually create a tuple of `UnitRange`s.
>
> If you want to use colons, then you'd write
>
> ```Julia
> inds = ((:) for i in 1:rank)
> dset[inds..., i] = xyz
> ```
>
> -erik
>
>
>
>
> On Tue, Sep 13, 2016 at 5:08 PM, Anandaroop Ray <[email protected]>
> wrote:
>
>> Many thanks for your comprehensive recommendations. I think HDF5 views
>> are probably what I need to go with - will read up more and then ask.
>>
>> What I mean about dimension is rank, really. The shape is always the same
>> for all samples. One slice for storage, i.e., one sample, could be chunked
>> as dset[:,:,i] or dset[:,:,:,:,i] but always of the form, dset[:,...,:i],
>> depending on input to the code at runtime.
>>
>> Thanks
>>
>> On 13 September 2016 at 14:47, Erik Schnetter <[email protected]>
>> wrote:
>>
>>> On Tue, Sep 13, 2016 at 11:27 AM, sparrowhawker <[email protected]
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm new to Julia, and have been able to accomplish a lot of what I used
>>>> to do in Matlab/Fortran, in very little time since I started using Julia in
>>>> the last three months. Here's my newest stumbling block.
>>>>
>>>> I have a process which creates nsamples within a loop. Each sample
>>>> takes a long time to compute as there are expensive finite difference
>>>> operations, which ultimately lead to a sample, say 1 to 10 seconds. I have
>>>> to store each of the nsamples, and I know the size and dimensions of each
>>>> of the nsamples (all samples have the same size and dimensions). However,
>>>> depending on the run time parameters, each sample may be a 32x32 image or
>>>> perhaps a 64x64x64 voxset with 3 attributes, i.e., a 64x64x64x3
>>>> hyper-rectangle. To be clear, each sample can be an arbitrary dimension
>>>> hyper-rectangle, specified at run time.
>>>>
>>>> Obviously, since I don't want to lose computation and want to see
>>>> incremental progress, I'd like to do incremental saves of these samples on
>>>> disk, instead of waiting to collect all nsamples at the end. For instance,
>>>> if I had to store 1000 samples of size 64x64, I thought perhaps I could
>>>> chunk and save 64x64 slices to an HDF5 file 1000 times. Is this the right
>>>> approach? If so, here's a prototype program to do so, but it depends on my
>>>> knowing the number of dimensions of the slice, which is not known until
>>>> runtime,
>>>>
>>>> using HDF5
>>>>
>>>> filename = "test.h5"
>>>> # open file
>>>> fmode ="w"
>>>> # get a file object
>>>> fid = h5open(filename, fmode)
>>>> # matrix to write in chunks
>>>> B = rand(64,64,1000)
>>>> # figure out its dimensions
>>>> sizeTuple = size(B)
>>>> Ndims = length(sizeTuple)
>>>> # set up to write in chunks of sizeArray
>>>> sizeArray = ones(Int, Ndims)
>>>> [sizeArray[i] = sizeTuple[i] for i in 1:(Ndims-1)] # last value of size
>>>> array is :...:,1
>>>> # create a dataset models within root
>>>> dset = d_create(fid, "models", datatype(Float64), dataspace(size(B)),
>>>> "chunk", sizeArray)
>>>> [dset[:,:,i] = slicedim(B, Ndims, i) for i in 1:size(B, Ndims)]
>>>> close(fid)
>>>>
>>>> This works, but the second last line, dset[:,:,i] requires syntax
>>>> specific to writing a slice of a dimension 3 array - but I don't know the
>>>> dimensions until run time. Of course I could just write to a flat binary
>>>> file incrementally, but HDF5.jl could make my life so much simpler!
>>>>
>>>
>>> HDF5 supports "extensible datasets", which were created for use cases
>>> such as this one. I don't recall the exact syntax, but if I recall
>>> correctly, you can specify one dimension (the first one in C, the last one
>>> in Julia) to be extensible, and then you can add more data as you go. You
>>> will probably need to specify a chunk size, which could be the size of the
>>> increment in your case. Given file system speeds, a chunk size smaller than
>>> a few MegaBytes probably don't make much sense (i.e. will slow things down).
>>>
>>> If you want to monitor the HDF5 file as it is being written, look at the
>>> SWIMR feature. This requires HDF5 1.10; unfortunately, Julia will by
>>> default often still install version 1.8.
>>>
>>> If you want to protect against crashes of your code so that you don't
>>> lose progress, then HDF5 is probably not right for you. Once an HDF5 file
>>> is open for writing, the on-disk state might be inconsistent, so that you
>>> can lose all data when your code crashes. In this case, you might want to
>>> write data into different files, one per increment. HDF5 1.0 offers
>>> "views", which are umbrella files that stitch together datasets stored in
>>> other files.
>>>
>>> If you are looking for generic advice for setting up things with HDF5,
>>> then I recommend their documentation. If you are looking for how to access
>>> these features in Julia, or if you notice a feature that is not available
>>> in Julia, then we'll be happy to explain or correct things.
>>>
>>> What do mean by "dimension only known at run time" -- do you mean what
>>> Julia calls "size" (shape) or what Julia calls "dim" (rank)?
>>>
>>> Do all datasets have the same size, or do they differ? If they differ,
>>> then putting them into the same dataset might not make sense; in this case,
>>> I would write them into different datasets.
>>>
>>> -erik
>>>
>>> --
>>> Erik Schnetter <[email protected]> http://www.perimeterinstitute.
>>> ca/personal/eschnetter/
>>>
>>
>>
>
>
> --
> Erik Schnetter <[email protected]> http://www.perimeterinstitute.
> ca/personal/eschnetter/
>

Re: [julia-users] Advice on (perhaps) chunking to HDF5

Reply via email to