Re: [Parquet support]

2021-05-21 Thread Etienne Chauchot
Thanks Arvid ! Etienne On 21/05/2021 12:45, Arvid Heise wrote: Hi Etienne, I'm taking over and just left you a review. Sorry for the long delays. Best, Arvid On Fri, May 21, 2021 at 11:25 AM Etienne Chauchot wrote: Hi all, considering (see my email below) that the DataStream API is not

Re: [Parquet support]

2021-05-21 Thread Arvid Heise
Hi Etienne, I'm taking over and just left you a review. Sorry for the long delays. Best, Arvid On Fri, May 21, 2021 at 11:25 AM Etienne Chauchot wrote: > Hi all, > > considering (see my email below) that the DataStream API is not fully > functional yet (batch mode) and considering that the ne

Re: [Parquet support]

2021-05-21 Thread Etienne Chauchot
Hi all, considering (see my email below) that the DataStream API is not fully functional yet (batch mode) and considering that the new sources are only available on DataStream API,  can we merge this PR (1) about existing sources in DataSet API? It has received already one LGTM. [1] https://

Re: [Parquet support]

2021-05-06 Thread Etienne Chauchot
Hi, @Jingsong, I agree that adding a new feature (ParquetAvroInputFormat) to an old source API is a maintenance burden. But IMHO I think that while the new DataStream batch/streaming convergent API is not 100% functional we still need to maintain older sources and add missing features to them.

Re: [Parquet support]

2021-03-16 Thread Etienne Chauchot
Hi all, The fix (https://issues.apache.org/jira/browse/FLINK-21388) is now also available for flink 1.12 also (thanks Jingsong for merging the cherrypick PR) But before releasing 1.12 branch, I'd like this other PR to be merged: https://github.com/apache/flink/pull/15156 that introduces Par

Re: [Parquet support]

2021-03-12 Thread Etienne Chauchot
Hi Jingsong, I just submitted a cherry-pick PR https://github.com/apache/flink/pull/15172 of (1) to release-.1.12 branch [1] https://github.com/apache/flink/pull/14961 Etienne On 12/03/2021 14:55, Etienne Chauchot wrote: Hi Jingsong, No problem for the delay. Thanks for merging the first p

Re: [Parquet support]

2021-03-12 Thread Etienne Chauchot
Hi Jingsong, No problem for the delay. Thanks for merging the first parquet PR. I'll submit the 2 PRs to 1.12 when they're all merged to master. For that, I just have to submit a PR against this branch: https://github.com/apache/flink/tree/release-1.12 ? Best, Etienne On 12/03/2021 03:56,

Re: [Parquet support]

2021-03-11 Thread Jingsong Li
Hi Etienne, Sorry for the late reply, I just merged your bug fixing. I think you can submit a PR for release-1.12. Best, Jingsong On Fri, Mar 12, 2021 at 12:22 AM Etienne Chauchot wrote: > Hi, > > I forgot to mention that I submitted the new ParquetAvroInputFormat to > master (1.13) but it is

Re: [Parquet support]

2021-03-11 Thread Etienne Chauchot
Hi, I forgot to mention that I submitted the new ParquetAvroInputFormat to master (1.13) but it is made to work for 1.12.x (last release) also and I'm using it with Flink 1.12.x. Maybe it could be a good candidate to be included in an upcoming 1.12.3 release, WDYT ? Best Etienne On 11/03

Re: [Parquet support]

2021-03-11 Thread Etienne Chauchot
Hi all, I just submitted another parquet PR that adds ParquetAvroInputFormat (I'm using it in a benchmark I'm coding). If anyone is interested in reviewing it, be my guest: https://github.com/apache/flink/pull/15156 I have also an older parquet PR that fixes a format conversion bug that is

Re: [Parquet support]

2021-02-26 Thread Etienne Chauchot
Hi all, Jingsong, thanks it makes sense. Besides, sorry but I found another bug in ParquetInputFormat: https://issues.apache.org/jira/browse/FLINK-21520 For my urgent needs, I'll workaround by filtering the dataSet rather than applying the filter in the ParquetInputFormat at source reading ti

Re: [Parquet support]

2021-02-24 Thread Jingsong Li
Hi Etienne, ParquetColumnarRowInputFormat is not fully functional yet, it has a good performance, but it is hard to support complex types, like array and map... So I think a migrated ParquetInputFormat version is required. Best, Jingsong On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot wrote:

Re: [Parquet support]

2021-02-23 Thread Etienne Chauchot
Hi, Thanks guys for the comments ! I did not know it was legacy. I will give the new sources a try. Jingsong, when you say "migrate ParquetInputFormat to the new BulkFormat interface", do you mean that the new ParquetColumnarRowInputFormat is not fully functional yet? In the meantime, if yo

Re: [Parquet support]

2021-02-23 Thread Peter Huang
Hi Jingsong, Thanks for pointing this out. Actually, I planned to work on changing interfaces ParquetTableSource and ParquetInputFormat. After refactoring the code, I may also help to fix the issue in https://issues.apache.org/jira/browse/FLINK-21468. Best Regards Peter Huang On Tue, Feb 23, 202

Re: [Parquet support]

2021-02-23 Thread Jingsong Li
Hi Etienne, Thanks for your reporting. There are indeed many problems. There is no doubt that we need to improve our current format implementation. But ParquetTableSource and ParquetInputFormat are legacy implementations with legacy interfaces. We have introduced new interfaces for execution and

[Parquet support]

2021-02-23 Thread Etienne Chauchot
Hi all, I've been playing with Parquet with SQL and Avro lately. I've found some bugs: 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already submitted a PR on this one (https://github.com/apache/flink/pull/14961) 2. https://issues.apache.org/jira/browse/FLINK-21389 3. https://is