Re: [Parquet support]

Arvid Heise Fri, 21 May 2021 03:46:21 -0700

Hi Etienne,

I'm taking over and just left you a review. Sorry for the long delays.


Best,

Arvid

On Fri, May 21, 2021 at 11:25 AM Etienne Chauchot <echauc...@apache.org>
wrote:

> Hi all,
>
> considering (see my email below) that the DataStream API is not fully
> functional yet (batch mode) and considering that the new sources are
> only available on DataStream API,  can we merge this PR (1) about
> existing sources in DataSet API? It has received already one LGTM.
>
> [1] https://github.com/apache/flink/pull/15156
>
> anyone ?
>
> Etienne
>
> On 06/05/2021 14:23, Etienne Chauchot wrote:
> > Hi,
> >
> > @Jingsong, I agree that adding a new feature (ParquetAvroInputFormat)
> > to an old source API is a maintenance burden. But IMHO I think that
> > while the new DataStream batch/streaming convergent API is not 100%
> > functional we still need to maintain older sources and add missing
> > features to them.
> >
> > Indeed, I realized that DataStream API in batch mode (1) does not
> > support aggregations yet (2) so in such a case a user would stick to
> > the DataSet API. And the new FileSource API with
> > ParquetColumnarRowInputFormat is not available in DataSet API (3).
> >
> > So, long story short, in some cases a user will have no other choice
> > than using ParquetInputFormat and legacy source.
> >
> > WDYT ?
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-19316
> >
> > [2] https://issues.apache.org/jira/browse/FLINK-22587
> >
> > [3]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-Compatibility,Deprecation,andMigrationPlan
> >
> > Best,
> >
> > Etienne
> >
> > On 24/02/2021 03:35, Jingsong Li wrote:
> >> Hi Etienne,
> >>
> >> Thanks for your reporting.
> >>
> >> There are indeed many problems. There is no doubt that we need to
> >> improve
> >> our current format implementation.
> >>
> >> But ParquetTableSource and ParquetInputFormat are legacy implementations
> >> with legacy interfaces. We have introduced new interfaces for
> >> execution and
> >> SQL. You can see:
> >> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
> >> for
> >> columnar row reading, not support complex types, we need
> >> migrate ParquetInputFormat to the new BulkFormat interface.
> >> - FileSystemTableSource with DynamicTableSource interface, It is a
> >> generic
> >> FileSystem source for all formats, we can just use it for parquet too.
> >>
> >> Considering ParquetTableSource and ParquetInputFormat are legacy
> >> interfaces, I think we can finish migration work first, what do you
> >> think?
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauc...@apache.org
> >
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I've been playing with Parquet with SQL and Avro lately. I've found
> >>> some
> >>> bugs:
> >>>
> >>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
> >>> submitted a PR on this one (https://github.com/apache/flink/pull/14961
> )
> >>>
> >>> 2. https://issues.apache.org/jira/browse/FLINK-21389
> >>>
> >>> 3. https://issues.apache.org/jira/browse/FLINK-21468
> >>>
> >>> I've already started to work on this ticket:
> >>> https://issues.apache.org/jira/browse/FLINK-21393
> >>>
> >>>
> >>> I'd be happy to receive your comments on these tickets
> >>>
> >>>
> >>> Best
> >>>
> >>> Etienne Chauchot
> >>>
> >>>
> >>>
>

Re: [Parquet support]

Reply via email to