RE: FlinkFileIO implementation

Ferenc Csaky Mon, 22 Apr 2024 11:00:48 -0700

Hi Peter,

I am coming from the Flink side, but at Cloudera we also use
Iceberg as well.


Utilizing the Flink delegation token fw via the Iceberg Java API
would be great. I think that simplifying the configuration for
Flink related cases also has value on its own, and could help to
eliminate some confusion regarding when/where set properties.

Regarding the naming, maybe it would worth to be more explicit and
call the class `FlinkFSFileIO`? Just to emphasize that "Flink" in
this context is referred as the FS abstraction layer, not as a
processing engine. WDYT?

Looking forward to this change!

Best,
Ferenc

On 2024/04/19 17:08:23 Péter Váry wrote:
> Hi Iceberg Team,
>
> Flink has its own FileSystem implementation. See:
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/
> .
> This FileSystem already has several implementations:
>
>    - Hadoop
>    - Azure
>    - S3
>    - Google Cloud Storage
>    - ...
>
> As a general rule in Flink, one should use this FileSystem to consume and
> persistently store data.
> If these FileSystems are configured, then Flink makes sure that the
> configurations are consistent and available for the JM/TM.
> Also as an added benefit, delegation tokens are handled and distributed for
> these FileSystems automatically. See:
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-delegation-token/
>
> In house, some of our new users are struggling with parametrizing
> HadooFileIO, and S3FileIO for Iceberg, trying to wrap their head around
> that they have to provide different configurations for the checkpointing
> and for the Iceberg table storage (even if they are stored in the same
> bucket, or on the same HDFS cluster)
>
> I have created a PR, which provides a FileIO implementation which uses
> FlinkFileSystem. Very imaginatively I have named it FlinkFileIO. See:
> https://github.com/apache/iceberg/pull/10151
>
> This would allow the users to configure the FileSystem only once, and use
> this FileSystem to access Iceberg tables. Also, if for whatever reason the
> global nature of flink file system config is limiting, the users still
> could revert back using the other FileIO implementations.
>
> What do you think? Would this be a useful addition to the Iceberg-Flink
> integration?
>
> Thanks,
> Peter
>

RE: FlinkFileIO implementation

Reply via email to