I think the idea of introducing a Flink-specific FileIO isn't a good idea. The intent of the Java API is for a table to use the FileIO instance that is supplied by the table object. That puts the responsibility for supplying a correctly configure FileIO on the catalog, which is the right place to inject most customization.
Having a Flink FileIO doesn't fit with that model. You could expose a generic FileIO from the catalog if you wanted, which would make a lot more sense. But FileIO is not the same thing as a FileSystem implementation. It should be used in places where it makes sense to use FileIO and should not be given the same lifecycle or responsibilities as a FileSystem. Ryan On Mon, Apr 22, 2024 at 11:00 AM Ferenc Csaky <ferenc.cs...@pm.me.invalid> wrote: > Hi Peter, > > I am coming from the Flink side, but at Cloudera we also use > Iceberg as well. > > Utilizing the Flink delegation token fw via the Iceberg Java API > would be great. I think that simplifying the configuration for > Flink related cases also has value on its own, and could help to > eliminate some confusion regarding when/where set properties. > > Regarding the naming, maybe it would worth to be more explicit and > call the class `FlinkFSFileIO`? Just to emphasize that "Flink" in > this context is referred as the FS abstraction layer, not as a > processing engine. WDYT? > > Looking forward to this change! > > Best, > Ferenc > > On 2024/04/19 17:08:23 Péter Váry wrote: > > Hi Iceberg Team, > > > > Flink has its own FileSystem implementation. See: > > > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/ > > . > > This FileSystem already has several implementations: > > > > - Hadoop > > - Azure > > - S3 > > - Google Cloud Storage > > - ... > > > > As a general rule in Flink, one should use this FileSystem to consume and > > persistently store data. > > If these FileSystems are configured, then Flink makes sure that the > > configurations are consistent and available for the JM/TM. > > Also as an added benefit, delegation tokens are handled and distributed > for > > these FileSystems automatically. See: > > > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-delegation-token/ > > > > In house, some of our new users are struggling with parametrizing > > HadooFileIO, and S3FileIO for Iceberg, trying to wrap their head around > > that they have to provide different configurations for the checkpointing > > and for the Iceberg table storage (even if they are stored in the same > > bucket, or on the same HDFS cluster) > > > > I have created a PR, which provides a FileIO implementation which uses > > FlinkFileSystem. Very imaginatively I have named it FlinkFileIO. See: > > https://github.com/apache/iceberg/pull/10151 > > > > This would allow the users to configure the FileSystem only once, and use > > this FileSystem to access Iceberg tables. Also, if for whatever reason > the > > global nature of flink file system config is limiting, the users still > > could revert back using the other FileIO implementations. > > > > What do you think? Would this be a useful addition to the Iceberg-Flink > > integration? > > > > Thanks, > > Peter > > > -- Ryan Blue Tabular