Re: Parquet+Arrow Java

2017-07-24 Thread Wes McKinney
Currently in C++, I believe the Parquet interface produces a single record batch per read (which is generally a whole row group or a whole file with some number of columns selected). In principle, it would be better to generate a sequence of smaller record batches (e.g. with 64K rows or so). We sup

Re: Parquet+Arrow Java

2017-07-24 Thread Masayuki Takahashi
Hi Wes, I understood it thanks to the explanation. And I will refer to C ++ implementation. > but I suspect we will eventually need a "scanner" that yields a > sequence of evenly sized record batches (so individual chunks are not > too large in memory). Such an interface can be used in an asynchr

Re: Parquet+Arrow Java

2017-07-23 Thread Wes McKinney
hi Masayuki, I don't have direct experience using Arrow with Parquet in Java, but a common approach is to set a batch size (number of logical rows) and compute a sequence of Arrow record batches converted from the Parquet file. We are only supporting monolithic file and row group reads in C++ (ht

Re: Parquet+Arrow Java

2017-07-23 Thread Masayuki Takahashi
Hi, I try to convert Parquet files to Arrow. https://gist.github.com/masayuki038/4be6c8538dfd4563a8d5ff743cf375ae And I have a question. When converting Parquet to Arrow, is it the right idea to make Arrow's VectorSchemaRoot for each RowGroup of Parquet? thanks. 2017-07-21 5:19 GMT+09:00 Wes M

Re: Parquet+Arrow Java

2017-07-23 Thread Michael Shtelma
yes, this would be great to have a component/library, that can be embedded in any other product and be able to perform operations like aggregation/join/filter/etc with arrow datasets. Do you think it is really hard to extract this part out of dremio-oss ? Sincerely, Michael Shtelma On Sat, Jul 2

Re: Parquet+Arrow Java

2017-07-21 Thread Jacques Nadeau
We do have relational operators as well in our code. We're trying to figure out what to contribute back and how to factor. For now, the code is under Apache license you are free to use. Our relational operations are under here: https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main

Re: Parquet+Arrow Java

2017-07-21 Thread Michael Shtelma
Hi Wes, It is really great, that you have open-sourced all this! As far as I understand, you have also open-sourced the engine that can execute relational operators on arrow ? Is it possible to use it as library ? Are you also planning to donate it arrow project at some point? Sincerely, Michael

Re: Parquet+Arrow Java

2017-07-20 Thread Wes McKinney
hi Sven, There is a placeholder project in apache/parquet-mr https://github.com/apache/parquet-mr/tree/master/parquet-arrow. It appears in the meantime that Dremio has created a vectorized Parquet <-> Arrow reader/writer which has just been open sourced under ASL 2.0: https://github.com/dremio/d

Parquet+Arrow Java

2017-07-19 Thread Sven Wagner-Boysen
Hi, I started looking into the projects Parquet and Arrow. Looks very promising to me. I also came across PyArrow and the Parquet-Arrow integration in Python. Is there something similar available for Java? https://arrow.apache.org/docs/python/parquet.html Thanks Sven