The ResolvingFIleIO is somewhat a different issue and more complicated with
a concept like FlinkFileIO because the schemes would overlap.

The main issue here is around how Flink handles file system operations
outside of the Iceberg space (e.g. checkpointing) and the confusion it
causes for people setting up Flink.

I'm concerned that the FlinkFileIO approach will ultimately just push that
problem to the client side, since much of the FileIO configuration for a
table will come from the catalog (like Ryan pointed out).

We need to discuss this a little more and see if there's a way to preserve
catalog/table managed configuration along with simplifying the config for


On Thu, Apr 25, 2024 at 9:48 AM Jean-Baptiste Onofré <j...@nanthrax.net>

> Hi Peter,
> On a similar topic, I created a PR to support custom schema in
> ResolvingFileIO (https://github.com/apache/iceberg/pull/9884). Maybe
> the FlinkIO can be a new schema/extension in the ResolvingFileIO.
> If I agree that it would be interesting to have support for
> FlinkFileIO, I'm not sure it's a good idea to have it directly in the
> Iceberg. I think it would be great to leverage the extension mechanism
> we have in Iceberg (FileIO/ResolvingFileIO).
> Iceberg Core should not include engine specific dependency imho.
> However, having a "flink:" schema in ResolvingFileIO where we can
> leverage FlinkFileIO could be interesting.
> Just thinking out loud :)
> Regards
> JB
> On Fri, Apr 19, 2024 at 12:08 PM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
> >
> > Hi Iceberg Team,
> >
> > Flink has its own FileSystem implementation. See:
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/
> .
> > This FileSystem already has several implementations:
> >
> > Hadoop
> > Azure
> > S3
> > Google Cloud Storage
> > ...
> >
> > As a general rule in Flink, one should use this FileSystem to consume
> and persistently store data.
> > If these FileSystems are configured, then Flink makes sure that the
> configurations are consistent and available for the JM/TM.
> > Also as an added benefit, delegation tokens are handled and distributed
> for these FileSystems automatically. See:
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-delegation-token/
> >
> > In house, some of our new users are struggling with parametrizing
> HadooFileIO, and S3FileIO for Iceberg, trying to wrap their head around
> that they have to provide different configurations for the checkpointing
> and for the Iceberg table storage (even if they are stored in the same
> bucket, or on the same HDFS cluster)
> >
> > I have created a PR, which provides a FileIO implementation which uses
> FlinkFileSystem. Very imaginatively I have named it FlinkFileIO. See:
> https://github.com/apache/iceberg/pull/10151
> >
> > This would allow the users to configure the FileSystem only once, and
> use this FileSystem to access Iceberg tables. Also, if for whatever reason
> the global nature of flink file system config is limiting, the users still
> could revert back using the other FileIO implementations.
> >
> > What do you think? Would this be a useful addition to the Iceberg-Flink
> integration?
> >
> > Thanks,
> > Peter

Reply via email to