Hi Michael,

I think ArrowFileReader takes SeekableByteChannel so it's possible to only
read the metadata for each record batches and skip the data. However it is
not implemented.

If the input Channel is not seekable (for example, a socket channel) then
you would need to read the body for each record batches to get the next
batch, so my hunch is that the performance will be similar whether you read
record batch body into VectorSchemaRoot or just read the bytes.

If you don't assume your input data is always going to be seekable, I am
not sure there is a quicker way to do this.



On Fri, Sep 21, 2018 at 9:33 AM Michael Knopf <mkn...@rapidminer.com> wrote:

> Hi all,
>
> I am looking for a quick way to look up the total row count of a data set
> stored in Arrow’s random access file format using the Java API. Basically,
> a quicker way to do this:
>
> // The reader is in an instance of ArrowFileReader
> List<ArrowBlock> blocks = reader.getRecordBlocks();
> int nRows = 0;
> for (ArrowBlock block : blocks) {
>     reader.loadRecordBatch(block);
>     nRows += root.getRowCount();
> }
>
> My understanding is that the above snippets loads the entire data set
> instead of just the block headers.
>
> To give you some context, I am looking into using Arrow for IPC between a
> JVM and a Python interpreter using a custom data format and PyArrow/Pandas
> respectively. While the streaming API might be a better tool for this job,
> I started out with using files to keep things simple.
>
> Any help would be greatly appreciated – maybe I just missed the right bit
> of documentation.
>
> Thanks,
> Michael

Reply via email to