Currently in C++, I believe the Parquet interface produces a single
record batch per read (which is generally a whole row group or a whole
file with some number of columns selected). In principle, it would be
better to generate a sequence of smaller record batches (e.g. with 64K
rows or so). We sup
Hi Wes,
I understood it thanks to the explanation. And I will refer to C ++
implementation.
> but I suspect we will eventually need a "scanner" that yields a
> sequence of evenly sized record batches (so individual chunks are not
> too large in memory). Such an interface can be used in an asynchr
hi Masayuki,
I don't have direct experience using Arrow with Parquet in Java, but a
common approach is to set a batch size (number of logical rows) and
compute a sequence of Arrow record batches converted from the Parquet
file.
We are only supporting monolithic file and row group reads in C++
(ht
Hi,
I try to convert Parquet files to Arrow.
https://gist.github.com/masayuki038/4be6c8538dfd4563a8d5ff743cf375ae
And I have a question.
When converting Parquet to Arrow, is it the right idea to make Arrow's
VectorSchemaRoot for each RowGroup of Parquet?
thanks.
2017-07-21 5:19 GMT+09:00 Wes M
yes, this would be great to have a component/library, that can be embedded
in any other product and be able to perform operations like
aggregation/join/filter/etc with arrow datasets.
Do you think it is really hard to extract this part out of dremio-oss ?
Sincerely,
Michael Shtelma
On Sat, Jul 2
We do have relational operators as well in our code. We're trying to figure
out what to contribute back and how to factor. For now, the code is under
Apache license you are free to use. Our relational operations are under
here:
https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main
Hi Wes,
It is really great, that you have open-sourced all this!
As far as I understand, you have also open-sourced the engine that can
execute relational operators on arrow ?
Is it possible to use it as library ?
Are you also planning to donate it arrow project at some point?
Sincerely,
Michael
hi Sven,
There is a placeholder project in apache/parquet-mr
https://github.com/apache/parquet-mr/tree/master/parquet-arrow.
It appears in the meantime that Dremio has created a vectorized
Parquet <-> Arrow reader/writer which has just been open sourced under
ASL 2.0:
https://github.com/dremio/d
Hi,
I started looking into the projects Parquet and Arrow. Looks very promising
to me.
I also came across PyArrow and the Parquet-Arrow integration in Python. Is
there something similar available for Java?
https://arrow.apache.org/docs/python/parquet.html
Thanks
Sven