Hi Wes,

I understood it thanks to the explanation. And I will refer to C ++
implementation.

> but I suspect we will eventually need a "scanner" that yields a
> sequence of evenly sized record batches (so individual chunks are not
> too large in memory). Such an interface can be used in an asynchronous
> data flow setting.

In the current C++ implementation, the size of RecordBatch will be the
number of records in
 the file or the number of records in the RowGroup, right?

By making the size uniform, will you shorten the execution time when
running it in parallel?

thanks.

2017-07-24 10:24 GMT+09:00 Wes McKinney <wesmck...@gmail.com>:
> hi Masayuki,
>
> I don't have direct experience using Arrow with Parquet in Java, but a
> common approach is to set a batch size (number of logical rows) and
> compute a sequence of Arrow record batches converted from the Parquet
> file.
>
> We are only supporting monolithic file and row group reads in C++
> (https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.h)
> but I suspect we will eventually need a "scanner" that yields a
> sequence of evenly sized record batches (so individual chunks are not
> too large in memory). Such an interface can be used in an asynchronous
> data flow setting.
>
> - Wes
>
> On Sun, Jul 23, 2017 at 9:19 AM, Masayuki Takahashi
> <masayuki...@gmail.com> wrote:
>> Hi,
>>
>> I try to convert Parquet files to Arrow.
>> https://gist.github.com/masayuki038/4be6c8538dfd4563a8d5ff743cf375ae
>>
>> And I have a question.
>>
>> When converting Parquet to Arrow, is it the right idea to make Arrow's
>> VectorSchemaRoot for each RowGroup of Parquet?
>>
>> thanks.
>>
>> 2017-07-21 5:19 GMT+09:00 Wes McKinney <wesmck...@gmail.com>:
>>> hi Sven,
>>>
>>> There is a placeholder project in apache/parquet-mr
>>> https://github.com/apache/parquet-mr/tree/master/parquet-arrow.
>>>
>>> It appears in the meantime that Dremio has created a vectorized
>>> Parquet <-> Arrow reader/writer which has just been open sourced under
>>> ASL 2.0: 
>>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
>>>
>>> I am sure they are very busy right now, but it may be worth discussing
>>> factoring out this Parquet <-> Arrow interface into a library
>>> component that can be donated to Apache Parquet.
>>>
>>> - Wes
>>>
>>> On Wed, Jul 19, 2017 at 4:28 PM, Sven Wagner-Boysen
>>> <sven.wagner-boy...@signavio.com> wrote:
>>>> Hi,
>>>>
>>>> I started looking into the projects Parquet and Arrow. Looks very promising
>>>> to me.
>>>>
>>>> I also came across PyArrow and the Parquet-Arrow integration in Python. Is
>>>> there something similar available for Java?
>>>>
>>>> https://arrow.apache.org/docs/python/parquet.html
>>>>
>>>> Thanks
>>>> Sven
>>
>>
>>
>> --
>> 高橋 真之



-- 
高橋 真之

Reply via email to