hi, There have been a number of discussions over the years about on-disk pre-allocation strategies. No volunteers have implemented anything, though. Developing an HDF5 integration library with pre-allocation and buffer management utilities seems like a reasonable growth area for the project. The functionality provided by HDF5 and Apache Arrow (and whether they're doing the same things -- which they aren't) has actually been a common point of confusions for onlookers, so clarifying that one can work together with the other might be helpful.
Both in C++ and Python we have methods for assembling arrays and record batches from mutable buffers, so if you allocate the buffers, populate them, then you can assemble a record batch or table from them in a straightforward manner. - Wes On Tue, Nov 26, 2019 at 10:25 AM Francois Saint-Jacques <fsaintjacq...@gmail.com> wrote: > > Hello Maarten, > > In theory, you could provide a custom mmap-allocator and use the > builder facility. Since the array is still in "build-phase" and not > sealed, it should be fine if mremap changes the pointer address. This > might fail in practice since the allocator is also used for auxiliary > data, e.g. dictionary hash table data in the case of Dictionary type. > > > Another solution is to create a `FixedBuilder class where > - the number of elements is known > - the data type is of fixed width > - Nullability is know (whether you need an extra buffer). > > I think sooner or later we'll need such class. > > François > > On Tue, Nov 26, 2019 at 10:01 AM Maarten Breddels > <maartenbredd...@gmail.com> wrote: > > > > In vaex I always write the data to hdf5 as 1 large chunk (per column). > > The reason is that it allows the mmapped columns to be exposed as a > > single numpy array (talking numerical data only for now), which many > > people are quite comfortable with. > > > > The strategy for vaex to write unchunked data, is to first create an > > 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and > > write to that in chunks. > > > > This means that in vaex I need to support mutable data (only used > > internally, vaex' default is immutable data like arrow), since I need > > to write to the memory mapped data. It also makes the exporting code > > relatively simple. > > > > I could not find a way in Arrow to get something similar done, at > > least not without having a single pa.array instance for each column. I > > think Arrow's mindset is that you should just use chunks right? Or is > > this also something that can be considered for Arrow? > > > > An alternative would be to implement Arrow in hdf5, which I basically > > do now in vaex (with limited support). Again, I'm wondering if there > > is there an interest in storing arrow data in hdf5 from the Arrow > > community? > > > > cheers, > > > > Maarten