Re: FlinkFileIO implementation

Ryan Blue Mon, 22 Apr 2024 15:00:25 -0700

I think the idea of introducing a Flink-specific FileIO isn't a good idea.
The intent of the Java API is for a table to use the FileIO instance that
is supplied by the table object. That puts the responsibility for supplying
a correctly configure FileIO on the catalog, which is the right place to
inject most customization.


Having a Flink FileIO doesn't fit with that model. You could expose a
generic FileIO from the catalog if you wanted, which would make a lot more
sense. But FileIO is not the same thing as a FileSystem implementation. It
should be used in places where it makes sense to use FileIO and should not
be given the same lifecycle or responsibilities as a FileSystem.

Ryan

On Mon, Apr 22, 2024 at 11:00 AM Ferenc Csaky <ferenc.cs...@pm.me.invalid>
wrote:

> Hi Peter,
>
> I am coming from the Flink side, but at Cloudera we also use
> Iceberg as well.
>
> Utilizing the Flink delegation token fw via the Iceberg Java API
> would be great. I think that simplifying the configuration for
> Flink related cases also has value on its own, and could help to
> eliminate some confusion regarding when/where set properties.
>
> Regarding the naming, maybe it would worth to be more explicit and
> call the class `FlinkFSFileIO`? Just to emphasize that "Flink" in
> this context is referred as the FS abstraction layer, not as a
> processing engine. WDYT?
>
> Looking forward to this change!
>
> Best,
> Ferenc
>
> On 2024/04/19 17:08:23 Péter Váry wrote:
> > Hi Iceberg Team,
> >
> > Flink has its own FileSystem implementation. See:
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/
> > .
> > This FileSystem already has several implementations:
> >
> >    - Hadoop
> >    - Azure
> >    - S3
> >    - Google Cloud Storage
> >    - ...
> >
> > As a general rule in Flink, one should use this FileSystem to consume and
> > persistently store data.
> > If these FileSystems are configured, then Flink makes sure that the
> > configurations are consistent and available for the JM/TM.
> > Also as an added benefit, delegation tokens are handled and distributed
> for
> > these FileSystems automatically. See:
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-delegation-token/
> >
> > In house, some of our new users are struggling with parametrizing
> > HadooFileIO, and S3FileIO for Iceberg, trying to wrap their head around
> > that they have to provide different configurations for the checkpointing
> > and for the Iceberg table storage (even if they are stored in the same
> > bucket, or on the same HDFS cluster)
> >
> > I have created a PR, which provides a FileIO implementation which uses
> > FlinkFileSystem. Very imaginatively I have named it FlinkFileIO. See:
> > https://github.com/apache/iceberg/pull/10151
> >
> > This would allow the users to configure the FileSystem only once, and use
> > this FileSystem to access Iceberg tables. Also, if for whatever reason
> the
> > global nature of flink file system config is limiting, the users still
> > could revert back using the other FileIO implementations.
> >
> > What do you think? Would this be a useful addition to the Iceberg-Flink
> > integration?
> >
> > Thanks,
> > Peter
> >
>


-- 
Ryan Blue
Tabular

Re: FlinkFileIO implementation

Reply via email to