+1 from me, we would definitely like this feature. Anything like recordBatchNumRows, recordBatchRowCounts etc. seems clear naming-wise that it is talking about rows not bytes. The RecordBatchStatistics idea would also be fine for us although we don’t have immediate need for other statistics.
Our use-case is that we need to retrieve ranges from large datasets for both processing and display (i.e. pagination). For large-ish data stored in cloud buckets or HDFS reading the metadata for each batch isn’t really a performant option. Our solution at the moment is to use a constant batch size and store that size in the footer custom metadata, for datasets created / owned by our platform. Having this properly in the Arrow format would obviously be much better and would allow for variable size batches e.g. if data arrives in a series of deltas and we don’t want to re-batch it. Just adding this information into the file format and adding the write implementation to the various language implementations is enough for us. A second step would be to add a read-range operation to the language APIs - I suspect this would make the feature much more usable for a lot of people but it’s not essential for our particular project, since we already intercept the loading mechanism to get non-blocking behaviour. From a testability point of view it might still make sense to do both bits together though! On 2023/03/19 04:39:15 Steve Kim wrote: > Hello everyone, > > I would like to be able to quickly seek to an arbitrary row in an Arrow > file. > > With the current file format, reading the file footer alone is not enough to > determine the record batch that contains a given row index. The row counts > of the record batches are only found in the metadata for each record batch, > which are scattered at different offsets in the file. Multiple > non-contiguous small reads can be costly (e.g., HTTP GET requests to read > byte ranges from a S3 object). > > This problem has been discussed in GitHub issues: > > https://github.com/apache/arrow/issues/18250 > > https://github.com/apache/arrow/issues/24575 > > To solve this problem, I propose a small backwards-compatible change to the > file format. We can add a > > recordBatchLengths: [long]; > > field to the Footer table ( > https://github.com/apache/arrow/blob/main/format/File.fbs). The name and > type of this new field match the length field in the RecordBatch table. > This new field must be after the custom_metadata field in the Footer table > to satisfy constraints of FlatBuffers schema evolution. An Arrow file whose > footer lacks the recordBatchLengths field would be read with a default > value of null, which indicates that the row counts are not present. > > What do people think? > > Thanks, > Steve >