+1 from me, we would definitely like this feature.

Anything like recordBatchNumRows, recordBatchRowCounts etc. seems clear 
naming-wise that it is talking about rows not bytes. The RecordBatchStatistics 
idea would also be fine for us although we don’t have immediate need for other 
statistics.

Our use-case is that we need to retrieve ranges from large datasets for both 
processing and display (i.e. pagination).  For large-ish data stored in cloud 
buckets or HDFS reading the metadata for each batch isn’t really a performant 
option. Our solution at the moment is to use a constant batch size and store 
that size in the footer custom metadata, for datasets created / owned by our 
platform. Having this properly in the Arrow format would obviously be much 
better and would allow for variable size batches e.g. if data arrives in a 
series of deltas and we don’t want to re-batch it.

Just adding this information into the file format and adding the write 
implementation to the various language implementations is enough for us. A 
second step would be to add a read-range operation to the language APIs - I 
suspect this would make the feature much more usable for a lot of people but 
it’s not essential for our particular project, since we already intercept the 
loading mechanism to get non-blocking behaviour. From a testability point of 
view it might still make sense to do both bits together though!

On 2023/03/19 04:39:15 Steve Kim wrote:
> Hello everyone,
> 
> I would like to be able to quickly seek to an arbitrary row in an Arrow
> file.
> 
> With the current file format, reading the file footer alone is not enough to
> determine the record batch that contains a given row index. The row counts
> of the record batches are only found in the metadata for each record batch,
> which are scattered at different offsets in the file. Multiple
> non-contiguous small reads can be costly (e.g., HTTP GET requests to read
> byte ranges from a S3 object).
> 
> This problem has been discussed in GitHub issues:
> 
> https://github.com/apache/arrow/issues/18250
> 
> https://github.com/apache/arrow/issues/24575
> 
> To solve this problem, I propose a small backwards-compatible change to the
> file format. We can add a
> 
>     recordBatchLengths: [long];
> 
> field to the Footer table (
> https://github.com/apache/arrow/blob/main/format/File.fbs). The name and
> type of this new field match the length field in the RecordBatch table.
> This new field must be after the custom_metadata field in the Footer table
> to satisfy constraints of FlatBuffers schema evolution. An Arrow file whose
> footer lacks the recordBatchLengths field would be read with a default
> value of null, which indicates that the row counts are not present.
> 
> What do people think?
> 
> Thanks,
> Steve
> 

Reply via email to