hi Masayuki, I don't have direct experience using Arrow with Parquet in Java, but a common approach is to set a batch size (number of logical rows) and compute a sequence of Arrow record batches converted from the Parquet file.
We are only supporting monolithic file and row group reads in C++ (https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.h) but I suspect we will eventually need a "scanner" that yields a sequence of evenly sized record batches (so individual chunks are not too large in memory). Such an interface can be used in an asynchronous data flow setting. - Wes On Sun, Jul 23, 2017 at 9:19 AM, Masayuki Takahashi <masayuki...@gmail.com> wrote: > Hi, > > I try to convert Parquet files to Arrow. > https://gist.github.com/masayuki038/4be6c8538dfd4563a8d5ff743cf375ae > > And I have a question. > > When converting Parquet to Arrow, is it the right idea to make Arrow's > VectorSchemaRoot for each RowGroup of Parquet? > > thanks. > > 2017-07-21 5:19 GMT+09:00 Wes McKinney <wesmck...@gmail.com>: >> hi Sven, >> >> There is a placeholder project in apache/parquet-mr >> https://github.com/apache/parquet-mr/tree/master/parquet-arrow. >> >> It appears in the meantime that Dremio has created a vectorized >> Parquet <-> Arrow reader/writer which has just been open sourced under >> ASL 2.0: >> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet >> >> I am sure they are very busy right now, but it may be worth discussing >> factoring out this Parquet <-> Arrow interface into a library >> component that can be donated to Apache Parquet. >> >> - Wes >> >> On Wed, Jul 19, 2017 at 4:28 PM, Sven Wagner-Boysen >> <sven.wagner-boy...@signavio.com> wrote: >>> Hi, >>> >>> I started looking into the projects Parquet and Arrow. Looks very promising >>> to me. >>> >>> I also came across PyArrow and the Parquet-Arrow integration in Python. Is >>> there something similar available for Java? >>> >>> https://arrow.apache.org/docs/python/parquet.html >>> >>> Thanks >>> Sven > > > > -- > 高橋 真之