FlinkFileIO implementation

Péter Váry Fri, 19 Apr 2024 10:08:44 -0700

Hi Iceberg Team,

Flink has its own FileSystem implementation. See:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/
.
This FileSystem already has several implementations:


   - Hadoop
   - Azure
   - S3
   - Google Cloud Storage
   - ...

As a general rule in Flink, one should use this FileSystem to consume and
persistently store data.
If these FileSystems are configured, then Flink makes sure that the
configurations are consistent and available for the JM/TM.
Also as an added benefit, delegation tokens are handled and distributed for
these FileSystems automatically. See:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-delegation-token/

In house, some of our new users are struggling with parametrizing
HadooFileIO, and S3FileIO for Iceberg, trying to wrap their head around
that they have to provide different configurations for the checkpointing
and for the Iceberg table storage (even if they are stored in the same
bucket, or on the same HDFS cluster)

I have created a PR, which provides a FileIO implementation which uses
FlinkFileSystem. Very imaginatively I have named it FlinkFileIO. See:
https://github.com/apache/iceberg/pull/10151

This would allow the users to configure the FileSystem only once, and use
this FileSystem to access Iceberg tables. Also, if for whatever reason the
global nature of flink file system config is limiting, the users still
could revert back using the other FileIO implementations.

What do you think? Would this be a useful addition to the Iceberg-Flink
integration?

Thanks,
Peter

FlinkFileIO implementation

Reply via email to