Hi Etienne, I'm taking over and just left you a review. Sorry for the long delays.
Best, Arvid On Fri, May 21, 2021 at 11:25 AM Etienne Chauchot <echauc...@apache.org> wrote: > Hi all, > > considering (see my email below) that the DataStream API is not fully > functional yet (batch mode) and considering that the new sources are > only available on DataStream API, can we merge this PR (1) about > existing sources in DataSet API? It has received already one LGTM. > > [1] https://github.com/apache/flink/pull/15156 > > anyone ? > > Etienne > > On 06/05/2021 14:23, Etienne Chauchot wrote: > > Hi, > > > > @Jingsong, I agree that adding a new feature (ParquetAvroInputFormat) > > to an old source API is a maintenance burden. But IMHO I think that > > while the new DataStream batch/streaming convergent API is not 100% > > functional we still need to maintain older sources and add missing > > features to them. > > > > Indeed, I realized that DataStream API in batch mode (1) does not > > support aggregations yet (2) so in such a case a user would stick to > > the DataSet API. And the new FileSource API with > > ParquetColumnarRowInputFormat is not available in DataSet API (3). > > > > So, long story short, in some cases a user will have no other choice > > than using ParquetInputFormat and legacy source. > > > > WDYT ? > > > > [1] https://issues.apache.org/jira/browse/FLINK-19316 > > > > [2] https://issues.apache.org/jira/browse/FLINK-22587 > > > > [3] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-Compatibility,Deprecation,andMigrationPlan > > > > Best, > > > > Etienne > > > > On 24/02/2021 03:35, Jingsong Li wrote: > >> Hi Etienne, > >> > >> Thanks for your reporting. > >> > >> There are indeed many problems. There is no doubt that we need to > >> improve > >> our current format implementation. > >> > >> But ParquetTableSource and ParquetInputFormat are legacy implementations > >> with legacy interfaces. We have introduced new interfaces for > >> execution and > >> SQL. You can see: > >> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just > >> for > >> columnar row reading, not support complex types, we need > >> migrate ParquetInputFormat to the new BulkFormat interface. > >> - FileSystemTableSource with DynamicTableSource interface, It is a > >> generic > >> FileSystem source for all formats, we can just use it for parquet too. > >> > >> Considering ParquetTableSource and ParquetInputFormat are legacy > >> interfaces, I think we can finish migration work first, what do you > >> think? > >> > >> Best, > >> Jingsong > >> > >> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauc...@apache.org > > > >> wrote: > >> > >>> Hi all, > >>> > >>> I've been playing with Parquet with SQL and Avro lately. I've found > >>> some > >>> bugs: > >>> > >>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already > >>> submitted a PR on this one (https://github.com/apache/flink/pull/14961 > ) > >>> > >>> 2. https://issues.apache.org/jira/browse/FLINK-21389 > >>> > >>> 3. https://issues.apache.org/jira/browse/FLINK-21468 > >>> > >>> I've already started to work on this ticket: > >>> https://issues.apache.org/jira/browse/FLINK-21393 > >>> > >>> > >>> I'd be happy to receive your comments on these tickets > >>> > >>> > >>> Best > >>> > >>> Etienne Chauchot > >>> > >>> > >>> >