Good point about the schemas. That's true, it would be more complicated. Agree to have more discussion about that. Personally, I think it's not a bad idea to have the catalog as the "source" for FileIO, and let the engine/client deal with that. I think it's an engine/client responsibility (I remember a kind of similar discussion in Apache Beam with the runners).
Agree to discuss more :) Regards JB On Thu, Apr 25, 2024 at 12:41 PM Daniel Weeks <daniel.c.we...@gmail.com> wrote: > > JB, > > The ResolvingFIleIO is somewhat a different issue and more complicated with a > concept like FlinkFileIO because the schemes would overlap. > > The main issue here is around how Flink handles file system operations > outside of the Iceberg space (e.g. checkpointing) and the confusion it causes > for people setting up Flink. > > I'm concerned that the FlinkFileIO approach will ultimately just push that > problem to the client side, since much of the FileIO configuration for a > table will come from the catalog (like Ryan pointed out). > > We need to discuss this a little more and see if there's a way to preserve > catalog/table managed configuration along with simplifying the config for > users. > > -Dan > > On Thu, Apr 25, 2024 at 9:48 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: >> >> Hi Peter, >> >> On a similar topic, I created a PR to support custom schema in >> ResolvingFileIO (https://github.com/apache/iceberg/pull/9884). Maybe >> the FlinkIO can be a new schema/extension in the ResolvingFileIO. >> >> If I agree that it would be interesting to have support for >> FlinkFileIO, I'm not sure it's a good idea to have it directly in the >> Iceberg. I think it would be great to leverage the extension mechanism >> we have in Iceberg (FileIO/ResolvingFileIO). >> Iceberg Core should not include engine specific dependency imho. >> However, having a "flink:" schema in ResolvingFileIO where we can >> leverage FlinkFileIO could be interesting. >> >> Just thinking out loud :) >> >> Regards >> JB >> >> On Fri, Apr 19, 2024 at 12:08 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> > >> > Hi Iceberg Team, >> > >> > Flink has its own FileSystem implementation. See: >> > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/. >> > This FileSystem already has several implementations: >> > >> > Hadoop >> > Azure >> > S3 >> > Google Cloud Storage >> > ... >> > >> > As a general rule in Flink, one should use this FileSystem to consume and >> > persistently store data. >> > If these FileSystems are configured, then Flink makes sure that the >> > configurations are consistent and available for the JM/TM. >> > Also as an added benefit, delegation tokens are handled and distributed >> > for these FileSystems automatically. See: >> > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-delegation-token/ >> > >> > In house, some of our new users are struggling with parametrizing >> > HadooFileIO, and S3FileIO for Iceberg, trying to wrap their head around >> > that they have to provide different configurations for the checkpointing >> > and for the Iceberg table storage (even if they are stored in the same >> > bucket, or on the same HDFS cluster) >> > >> > I have created a PR, which provides a FileIO implementation which uses >> > FlinkFileSystem. Very imaginatively I have named it FlinkFileIO. See: >> > https://github.com/apache/iceberg/pull/10151 >> > >> > This would allow the users to configure the FileSystem only once, and use >> > this FileSystem to access Iceberg tables. Also, if for whatever reason the >> > global nature of flink file system config is limiting, the users still >> > could revert back using the other FileIO implementations. >> > >> > What do you think? Would this be a useful addition to the Iceberg-Flink >> > integration? >> > >> > Thanks, >> > Peter