Thanks Arvid !
Etienne
On 21/05/2021 12:45, Arvid Heise wrote:
Hi Etienne,
I'm taking over and just left you a review. Sorry for the long delays.
Best,
Arvid
On Fri, May 21, 2021 at 11:25 AM Etienne Chauchot <echauc...@apache.org>
wrote:
Hi all,
considering (see my email below) that the DataStream API is not fully
functional yet (batch mode) and considering that the new sources are
only available on DataStream API, can we merge this PR (1) about
existing sources in DataSet API? It has received already one LGTM.
[1] https://github.com/apache/flink/pull/15156
anyone ?
Etienne
On 06/05/2021 14:23, Etienne Chauchot wrote:
Hi,
@Jingsong, I agree that adding a new feature (ParquetAvroInputFormat)
to an old source API is a maintenance burden. But IMHO I think that
while the new DataStream batch/streaming convergent API is not 100%
functional we still need to maintain older sources and add missing
features to them.
Indeed, I realized that DataStream API in batch mode (1) does not
support aggregations yet (2) so in such a case a user would stick to
the DataSet API. And the new FileSource API with
ParquetColumnarRowInputFormat is not available in DataSet API (3).
So, long story short, in some cases a user will have no other choice
than using ParquetInputFormat and legacy source.
WDYT ?
[1] https://issues.apache.org/jira/browse/FLINK-19316
[2] https://issues.apache.org/jira/browse/FLINK-22587
[3]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-Compatibility,Deprecation,andMigrationPlan
Best,
Etienne
On 24/02/2021 03:35, Jingsong Li wrote:
Hi Etienne,
Thanks for your reporting.
There are indeed many problems. There is no doubt that we need to
improve
our current format implementation.
But ParquetTableSource and ParquetInputFormat are legacy implementations
with legacy interfaces. We have introduced new interfaces for
execution and
SQL. You can see:
- ParquetColumnarRowInputFormat with BulkFormat interface. It is just
for
columnar row reading, not support complex types, we need
migrate ParquetInputFormat to the new BulkFormat interface.
- FileSystemTableSource with DynamicTableSource interface, It is a
generic
FileSystem source for all formats, we can just use it for parquet too.
Considering ParquetTableSource and ParquetInputFormat are legacy
interfaces, I think we can finish migration work first, what do you
think?
Best,
Jingsong
On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauc...@apache.org
wrote:
Hi all,
I've been playing with Parquet with SQL and Avro lately. I've found
some
bugs:
1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
submitted a PR on this one (https://github.com/apache/flink/pull/14961
)
2. https://issues.apache.org/jira/browse/FLINK-21389
3. https://issues.apache.org/jira/browse/FLINK-21468
I've already started to work on this ticket:
https://issues.apache.org/jira/browse/FLINK-21393
I'd be happy to receive your comments on these tickets
Best
Etienne Chauchot