Re: Parquet+Arrow Java

Wes McKinney Mon, 24 Jul 2017 08:53:01 -0700

Currently in C++, I believe the Parquet interface produces a single
record batch per read (which is generally a whole row group or a whole
file with some number of columns selected). In principle, it would be
better to generate a sequence of smaller record batches (e.g. with 64K
rows or so). We support parallelization at the column level, so it
would be an interesting benchmarking experiment to measure effective
deserialization throughput based on batch size. SQL engines generally
process record batches asynchronously in a pipeline, so a Parquet
scanner prepares the next record batch while the current one is being
processed (rather than the processor thread blocking on
IO/deserialization).


I previously opened https://issues.apache.org/jira/browse/ARROW-1012
about creating a stream reader implementation for Parquet

- Wes

On Mon, Jul 24, 2017 at 11:38 AM, Masayuki Takahashi
<masayuki...@gmail.com> wrote:
> Hi Wes,
>
> I understood it thanks to the explanation. And I will refer to C ++
> implementation.
>
>> but I suspect we will eventually need a "scanner" that yields a
>> sequence of evenly sized record batches (so individual chunks are not
>> too large in memory). Such an interface can be used in an asynchronous
>> data flow setting.
>
> In the current C++ implementation, the size of RecordBatch will be the
> number of records in
>  the file or the number of records in the RowGroup, right?
>
> By making the size uniform, will you shorten the execution time when
> running it in parallel?
>
> thanks.
>
> 2017-07-24 10:24 GMT+09:00 Wes McKinney <wesmck...@gmail.com>:
>> hi Masayuki,
>>
>> I don't have direct experience using Arrow with Parquet in Java, but a
>> common approach is to set a batch size (number of logical rows) and
>> compute a sequence of Arrow record batches converted from the Parquet
>> file.
>>
>> We are only supporting monolithic file and row group reads in C++
>> (https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.h)
>> but I suspect we will eventually need a "scanner" that yields a
>> sequence of evenly sized record batches (so individual chunks are not
>> too large in memory). Such an interface can be used in an asynchronous
>> data flow setting.
>>
>> - Wes
>>
>> On Sun, Jul 23, 2017 at 9:19 AM, Masayuki Takahashi
>> <masayuki...@gmail.com> wrote:
>>> Hi,
>>>
>>> I try to convert Parquet files to Arrow.
>>> https://gist.github.com/masayuki038/4be6c8538dfd4563a8d5ff743cf375ae
>>>
>>> And I have a question.
>>>
>>> When converting Parquet to Arrow, is it the right idea to make Arrow's
>>> VectorSchemaRoot for each RowGroup of Parquet?
>>>
>>> thanks.
>>>
>>> 2017-07-21 5:19 GMT+09:00 Wes McKinney <wesmck...@gmail.com>:
>>>> hi Sven,
>>>>
>>>> There is a placeholder project in apache/parquet-mr
>>>> https://github.com/apache/parquet-mr/tree/master/parquet-arrow.
>>>>
>>>> It appears in the meantime that Dremio has created a vectorized
>>>> Parquet <-> Arrow reader/writer which has just been open sourced under
>>>> ASL 2.0: 
>>>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
>>>>
>>>> I am sure they are very busy right now, but it may be worth discussing
>>>> factoring out this Parquet <-> Arrow interface into a library
>>>> component that can be donated to Apache Parquet.
>>>>
>>>> - Wes
>>>>
>>>> On Wed, Jul 19, 2017 at 4:28 PM, Sven Wagner-Boysen
>>>> <sven.wagner-boy...@signavio.com> wrote:
>>>>> Hi,
>>>>>
>>>>> I started looking into the projects Parquet and Arrow. Looks very 
>>>>> promising
>>>>> to me.
>>>>>
>>>>> I also came across PyArrow and the Parquet-Arrow integration in Python. Is
>>>>> there something similar available for Java?
>>>>>
>>>>> https://arrow.apache.org/docs/python/parquet.html
>>>>>
>>>>> Thanks
>>>>> Sven
>>>
>>>
>>>
>>> --
>>> 高橋 真之
>
>
>
> --
> 高橋 真之

Re: Parquet+Arrow Java

Reply via email to