Currently in C++, I believe the Parquet interface produces a single record batch per read (which is generally a whole row group or a whole file with some number of columns selected). In principle, it would be better to generate a sequence of smaller record batches (e.g. with 64K rows or so). We support parallelization at the column level, so it would be an interesting benchmarking experiment to measure effective deserialization throughput based on batch size. SQL engines generally process record batches asynchronously in a pipeline, so a Parquet scanner prepares the next record batch while the current one is being processed (rather than the processor thread blocking on IO/deserialization).
I previously opened https://issues.apache.org/jira/browse/ARROW-1012 about creating a stream reader implementation for Parquet - Wes On Mon, Jul 24, 2017 at 11:38 AM, Masayuki Takahashi <masayuki...@gmail.com> wrote: > Hi Wes, > > I understood it thanks to the explanation. And I will refer to C ++ > implementation. > >> but I suspect we will eventually need a "scanner" that yields a >> sequence of evenly sized record batches (so individual chunks are not >> too large in memory). Such an interface can be used in an asynchronous >> data flow setting. > > In the current C++ implementation, the size of RecordBatch will be the > number of records in > the file or the number of records in the RowGroup, right? > > By making the size uniform, will you shorten the execution time when > running it in parallel? > > thanks. > > 2017-07-24 10:24 GMT+09:00 Wes McKinney <wesmck...@gmail.com>: >> hi Masayuki, >> >> I don't have direct experience using Arrow with Parquet in Java, but a >> common approach is to set a batch size (number of logical rows) and >> compute a sequence of Arrow record batches converted from the Parquet >> file. >> >> We are only supporting monolithic file and row group reads in C++ >> (https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.h) >> but I suspect we will eventually need a "scanner" that yields a >> sequence of evenly sized record batches (so individual chunks are not >> too large in memory). Such an interface can be used in an asynchronous >> data flow setting. >> >> - Wes >> >> On Sun, Jul 23, 2017 at 9:19 AM, Masayuki Takahashi >> <masayuki...@gmail.com> wrote: >>> Hi, >>> >>> I try to convert Parquet files to Arrow. >>> https://gist.github.com/masayuki038/4be6c8538dfd4563a8d5ff743cf375ae >>> >>> And I have a question. >>> >>> When converting Parquet to Arrow, is it the right idea to make Arrow's >>> VectorSchemaRoot for each RowGroup of Parquet? >>> >>> thanks. >>> >>> 2017-07-21 5:19 GMT+09:00 Wes McKinney <wesmck...@gmail.com>: >>>> hi Sven, >>>> >>>> There is a placeholder project in apache/parquet-mr >>>> https://github.com/apache/parquet-mr/tree/master/parquet-arrow. >>>> >>>> It appears in the meantime that Dremio has created a vectorized >>>> Parquet <-> Arrow reader/writer which has just been open sourced under >>>> ASL 2.0: >>>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet >>>> >>>> I am sure they are very busy right now, but it may be worth discussing >>>> factoring out this Parquet <-> Arrow interface into a library >>>> component that can be donated to Apache Parquet. >>>> >>>> - Wes >>>> >>>> On Wed, Jul 19, 2017 at 4:28 PM, Sven Wagner-Boysen >>>> <sven.wagner-boy...@signavio.com> wrote: >>>>> Hi, >>>>> >>>>> I started looking into the projects Parquet and Arrow. Looks very >>>>> promising >>>>> to me. >>>>> >>>>> I also came across PyArrow and the Parquet-Arrow integration in Python. Is >>>>> there something similar available for Java? >>>>> >>>>> https://arrow.apache.org/docs/python/parquet.html >>>>> >>>>> Thanks >>>>> Sven >>> >>> >>> >>> -- >>> 高橋 真之 > > > > -- > 高橋 真之