Hi John, > > Any thoughts on creating large IPC format record batch(es) in-place in a > single pre-allocated buffer, that could be used with mmap?
This seems doable today "by hand" today, it seems like this would be valuable to potentially contribute. The idea was to allow > record batch lengths to be smaller than the associated buffer lengths, > which seemed like an easy change at the time... although I'll grant that we > only use trivial arrow types and in more complex cases there may be > side-effects I can't envision. To my recollection, this was about "Array Lengths" not "Buffer lengths". I still think having Array Lengths disagree with RecordBatch Lengths is not a good idea. I think "Buffer Length" and "Buffer offset" can be arbitrary as long as they are within Body Length on Message. It really depends on the determination of the specification of: - The body, a flat sequence of memory buffers written end-to-end with appropriate padding to ensure a minimum of 8-byte alignment So one could imagine the following algorithm: 1. pointer a = Reserve space for RecordBatchMessage Metadata data. 2. pointer b = Reserve space for data buffers directly after reserved space of "a". 3. Populated data (Allocate buffers from B into standard arrow data structures, ensuring 8 byte alignment requirement) 4. Write out metadata to "a" (body length = "maximum end address of data buffer" - "b"), (buffer offsets are "start of buffer memory address - "b") 5. Add entry to the file index. This might not play nicely with other aspects of arrow IO (e.g. prefetching) but I think it still should be a valid file. I'd guess others would have opinions on this as well. Thanks, Micah On Wed, Oct 20, 2021 at 5:26 PM John Muehlhausen <j...@jgm.org> wrote: > Motivation: > > We have memory-mappable Arrow IPC files with N batches where column(s) are > sorted to support binary search. Because log2(n) < log2(n/2)+log2(n/2) and > binary search is required on each batch, we prefer the batches to be as > large as possible to reduce total search time... perhaps larger than > available RAM.... on the read side, only pages needed for the search > bisections and subsequent slice traversal are mapped in, of course. > > The question then becomes one of creating large IPC-format files where > individual batches do not exist first in RAM because of their size. > > Conceptually, this would seem to entail: > * allocating a fixed mmap'd area for writing to > * using builders to create buffers at the locations they would end up at > for an IPC format, and freezing these as arrays (if I understand the > terminology correctly) > * plopping in various other things such as metadata, schema, etc > > One difficulty is that we want to size this area without having first > analyzed the data to be written to it, since such an analysis consumes > compute resources. Therefore the area set aside for (e.g.) a variable > length string column would be a guess based on statistics and we would want > to just write the column buffers until the first one is full, which may > leave others (or itself) partially unpopulated. > > This could result in some "wasted space" in the file which is a tradeoff we > can live with for the above reasons, which brings me back to > https://issues.apache.org/jira/browse/ARROW-5916 where this was discussed > before (and another discussion is linked there). The idea was to allow > record batch lengths to be smaller than the associated buffer lengths, > which seemed like an easy change at the time... although I'll grant that we > only use trivial arrow types and in more complex cases there may be > side-effects I can't envision. > > One of the ideas was to go ahead and fill in the buffers to create a valid > recordbatch but then store the sliced-down size in (e.g.) the user-defined > metadata, but this forces anyone using the IPC file to use a non-standard > mechanism to reject the "data" that fills the unpopulated buffer sections. > > Even with the ability for a batch to be smaller than its buffers (to help > readers reject the residual of the buffers without referring to custom > metadata), I think I'm left with needing to create low-level code outside > of the Arrow library to create such a file since I cannot first create the > batch in RAM and then copy it out, due to the size and also due to wanting > to avoid the copy operation. > > Any thoughts on creating large IPC format record batch(es) in-place in a > single pre-allocated buffer, that could be used with mmap? > > Here is someone with a similar concern: > https://www.mail-archive.com/user@arrow.apache.org/msg01187.html > > It seems like the > https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html > example could be tweaked to use "pools" that defines exactly where to put > each buffer, but then the final `arrow::Table::Make` (or equivalent for > batches/IPC) must also receive instruction about where exactly to write > user metadata, schema, footer, etc. > > Thanks for any ideas, > John >