RE: Reading data from Iceberg table into Apache Arrow in Java

Mayur Srivastava Fri, 12 Feb 2021 16:40:58 -0800

Thank you Ryan.

I’ll dig into the file scan plan and Spark codebase to learn about the 
internals of Iceberg vectorized read path. Then, I’ll try to implement the 
vectorized reader using core components only. I’ll be happy to work with you to 
contribute it back to the upstream. I’ll get back to you if I’ve any question 
or need any more pointers.

Thanks,
Mayur

From: Ryan Blue <rb...@netflix.com.INVALID>
Sent: Friday, February 12, 2021 2:26 PM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: Re: Reading data from Iceberg table into Apache Arrow in Java

Hi Mayur,

We built the Arrow support with Spark as the first use case, so the best 
examples of how to use it are in Spark.

The generic reader does two things: it plans a scan and sets up an iterator of 
file readers to produce generic records. What you want to do is the same thing, 
but set up the file readers to produce Arrow batches. You can do that by 
changing the `Parquet.read` call and passing the callback to create an Arrow 
batch reader rather than generic row reader. I don't think there is a public 
example of this, but maybe someone else knows about one. This isn't available 
in Iceberg yet, but if you want to add it we'd be happy to help you get it in.

The Spark read path has a good example, but it also wraps the Arrow batches so 
Spark can read them. Also, keep in mind that the Arrow integration only 
supports flat schemas right now, not fully nested schemas. So you'd need to 
still fall back to the row-based path. (Side note, if you have code to convert 
generics to Arrow, that's really useful to post somewhere.)

I hope that helps. It would be great to work with you to improve this in a 
couple of PRs!

rb

On Thu, Feb 11, 2021 at 7:22 AM Mayur Srivastava 
<mayur.srivast...@twosigma.com<mailto:mayur.srivast...@twosigma.com>> wrote:

Hi,

We have an existing time series data access service based on Arrow/Flight which 
uses Apache Arrow format data to perform writes and reads (using time range 
queries) from a bespoke table-backend based on a S3 compatible storage.

We are trying to replace our bespoke table-backend with Iceberg tables. For 
integrating with Iceberg, we are using Iceberg core+data+parquet modules 
directly to write and read data. I would like to note that our service cannot 
use the Spark route to write or read the data. In our current Iceberg reader 
integration code, we are using 
IcebergGenerics.read(table).select(...).where(...).build() to iterate through 
the data row-by-row. Instead of this (potentially slower) read path which needs 
conversion between rows and Arrow VectorSchemaRoot, we want to use a vectorized 
read path which directly returns an Arrow VectorSchemaRoot as a callback or 
Arrow record batches as the result set.

I have noticed that Iceberg already has an Arrow module 
https://github.com/apache/iceberg/tree/master/arrow/src/main/java/org/apache/iceberg/arrow.
 I  have also looked into https://github.com/apache/iceberg/issues/9 and 
https://github.com/apache/iceberg/milestone/2. But, I’m not sure about the 
current status of the vectorized reader support. I’m also not sure how this 
Arrow module is being used to perform a vectorized read to execute a query on 
an Iceberg table in the core/data/parquet library.

I have a few questions regarding the Vectorized reader/Arrow support:

1.      Is it possible to run a vectorized read on an Iceberg table to return 
data in Arrow format using a non-Spark reader in Java?

2.      Is there an example of reading data in Arrow format from an Iceberg 
table?

3.      Is the Spark read path completely vectorized? I ask this question to 
find out if we can borrow from the vectorized Spark reader or we can move code 
from vectorized Spark reader to the Iceberg core library.

Let me know if you have any questions for me.

Thanks,

Mayur

--
Ryan Blue
Software Engineer
Netflix

RE: Reading data from Iceberg table into Apache Arrow in Java

Reply via email to