Hi all,

Jingsong, thanks it makes sense.

Besides, sorry but I found another bug in ParquetInputFormat:

https://issues.apache.org/jira/browse/FLINK-21520

For my urgent needs, I'll workaround by filtering the dataSet rather than applying the filter in the ParquetInputFormat at source reading time.

Etienne Chauchot

On 25/02/2021 08:34, Jingsong Li wrote:
Hi Etienne,

ParquetColumnarRowInputFormat is not fully functional yet, it has a good
performance, but it is hard to support complex types, like array and map...
So I think a migrated ParquetInputFormat version is required.

Best,
Jingsong

On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot <echauc...@apache.org>
wrote:

Hi,

Thanks guys for the comments !

I did not know it was legacy. I will give the new sources a try.

Jingsong, when you say "migrate ParquetInputFormat to the new BulkFormat
interface", do you mean that the new ParquetColumnarRowInputFormat is
not fully functional yet?

In the meantime, if you agree, I think I'm still gonna submit a PR for
https://issues.apache.org/jira/browse/FLINK-21393 because I need it on
an urgent task I'm doing.

Best

Etienne

On 24/02/2021 03:41, Peter Huang wrote:
Hi Jingsong,

Thanks for pointing this out. Actually, I planned to work on changing
interfaces ParquetTableSource and ParquetInputFormat.
After refactoring the code, I may also help to fix the issue in
https://issues.apache.org/jira/browse/FLINK-21468.

Best Regards
Peter Huang

On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li <jingsongl...@gmail.com>
wrote:
Hi Etienne,

Thanks for your reporting.

There are indeed many problems. There is no doubt that we need to
improve
our current format implementation.

But ParquetTableSource and ParquetInputFormat are legacy implementations
with legacy interfaces. We have introduced new interfaces for execution
and
SQL. You can see:
- ParquetColumnarRowInputFormat with BulkFormat interface. It is just
for
columnar row reading, not support complex types, we need
migrate ParquetInputFormat to the new BulkFormat interface.
- FileSystemTableSource with DynamicTableSource interface, It is a
generic
FileSystem source for all formats, we can just use it for parquet too.

Considering ParquetTableSource and ParquetInputFormat are legacy
interfaces, I think we can finish migration work first, what do you
think?
Best,
Jingsong

On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauc...@apache.org
wrote:

Hi all,

I've been playing with Parquet with SQL and Avro lately. I've found
some
bugs:

1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
submitted a PR on this one (https://github.com/apache/flink/pull/14961
)
2. https://issues.apache.org/jira/browse/FLINK-21389

3. https://issues.apache.org/jira/browse/FLINK-21468

I've already started to work on this ticket:
https://issues.apache.org/jira/browse/FLINK-21393


I'd be happy to receive your comments on these tickets


Best

Etienne Chauchot



--
Best, Jingsong Lee


Reply via email to