Re: Non-chunked large files / hdf5 support

Wes McKinney Wed, 27 Nov 2019 13:36:08 -0800

hi,

There have been a number of discussions over the years about on-disk
pre-allocation strategies. No volunteers have implemented anything,
though. Developing an HDF5 integration library with pre-allocation and
buffer management utilities seems like a reasonable growth area for
the project. The functionality provided by HDF5 and Apache Arrow (and
whether they're doing the same things -- which they aren't) has
actually been a common point of confusions for onlookers, so
clarifying that one can work together with the other might be helpful.


Both in C++ and Python we have methods for assembling arrays and
record batches from mutable buffers, so if you allocate the buffers,
populate them, then you can assemble a record batch or table from them
in a straightforward manner.

- Wes


On Tue, Nov 26, 2019 at 10:25 AM Francois Saint-Jacques
<fsaintjacq...@gmail.com> wrote:
>
> Hello Maarten,
>
> In theory, you could provide a custom mmap-allocator and use the
> builder facility. Since the array is still in "build-phase" and not
> sealed, it should be fine if mremap changes the pointer address. This
> might fail in practice since the allocator is also used for auxiliary
> data, e.g. dictionary hash table data in the case of Dictionary type.
>
>
> Another solution is to create a `FixedBuilder class where
> - the number of elements is known
> - the data type is of fixed width
> - Nullability is know (whether you need an extra buffer).
>
> I think sooner or later we'll need such class.
>
> François
>
> On Tue, Nov 26, 2019 at 10:01 AM Maarten Breddels
> <maartenbredd...@gmail.com> wrote:
> >
> > In vaex I always write the data to hdf5 as 1 large chunk (per column).
> > The reason is that it allows the mmapped columns to be exposed as a
> > single numpy array (talking numerical data only for now), which many
> > people are quite comfortable with.
> >
> > The strategy for vaex to write unchunked data, is to first create an
> > 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
> > write to that in chunks.
> >
> > This means that in vaex I need to support mutable data (only used
> > internally, vaex' default is immutable data like arrow), since I need
> > to write to the memory mapped data. It also makes the exporting code
> > relatively simple.
> >
> > I could not find a way in Arrow to get something similar done, at
> > least not without having a single pa.array instance for each column. I
> > think Arrow's mindset is that you should just use chunks right? Or is
> > this also something that can be considered for Arrow?
> >
> > An alternative would be to implement Arrow in hdf5, which I basically
> > do now in vaex (with limited support). Again, I'm wondering if there
> > is there an interest in storing arrow data in hdf5 from the Arrow
> > community?
> >
> > cheers,
> >
> > Maarten

Re: Non-chunked large files / hdf5 support

Reply via email to