Re: FlinkFileIO implementation

Jean-Baptiste Onofré Thu, 25 Apr 2024 12:09:00 -0700

Good point about the schemas. That's true, it would be more complicated.

Agree to have more discussion about that. Personally, I think it's not
a bad idea to have the catalog as the "source" for FileIO, and let the
engine/client deal with that.
I think it's an engine/client responsibility (I remember a kind of
similar discussion in Apache Beam with the runners).


Agree to discuss more :)

Regards
JB

On Thu, Apr 25, 2024 at 12:41 PM Daniel Weeks <daniel.c.we...@gmail.com> wrote:
>
> JB,
>
> The ResolvingFIleIO is somewhat a different issue and more complicated with a 
> concept like FlinkFileIO because the schemes would overlap.
>
> The main issue here is around how Flink handles file system operations 
> outside of the Iceberg space (e.g. checkpointing) and the confusion it causes 
> for people setting up Flink.
>
> I'm concerned that the FlinkFileIO approach will ultimately just push that 
> problem to the client side, since much of the FileIO configuration for a 
> table will come from the catalog (like Ryan pointed out).
>
> We need to discuss this a little more and see if there's a way to preserve 
> catalog/table managed configuration along with simplifying the config for 
> users.
>
> -Dan
>
> On Thu, Apr 25, 2024 at 9:48 AM Jean-Baptiste Onofré <j...@nanthrax.net> 
> wrote:
>>
>> Hi Peter,
>>
>> On a similar topic, I created a PR to support custom schema in
>> ResolvingFileIO (https://github.com/apache/iceberg/pull/9884). Maybe
>> the FlinkIO can be a new schema/extension in the ResolvingFileIO.
>>
>> If I agree that it would be interesting to have support for
>> FlinkFileIO, I'm not sure it's a good idea to have it directly in the
>> Iceberg. I think it would be great to leverage the extension mechanism
>> we have in Iceberg (FileIO/ResolvingFileIO).
>> Iceberg Core should not include engine specific dependency imho.
>> However, having a "flink:" schema in ResolvingFileIO where we can
>> leverage FlinkFileIO could be interesting.
>>
>> Just thinking out loud :)
>>
>> Regards
>> JB
>>
>> On Fri, Apr 19, 2024 at 12:08 PM Péter Váry <peter.vary.apa...@gmail.com> 
>> wrote:
>> >
>> > Hi Iceberg Team,
>> >
>> > Flink has its own FileSystem implementation. See: 
>> > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/.
>> > This FileSystem already has several implementations:
>> >
>> > Hadoop
>> > Azure
>> > S3
>> > Google Cloud Storage
>> > ...
>> >
>> > As a general rule in Flink, one should use this FileSystem to consume and 
>> > persistently store data.
>> > If these FileSystems are configured, then Flink makes sure that the 
>> > configurations are consistent and available for the JM/TM.
>> > Also as an added benefit, delegation tokens are handled and distributed 
>> > for these FileSystems automatically. See: 
>> > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-delegation-token/
>> >
>> > In house, some of our new users are struggling with parametrizing 
>> > HadooFileIO, and S3FileIO for Iceberg, trying to wrap their head around 
>> > that they have to provide different configurations for the checkpointing 
>> > and for the Iceberg table storage (even if they are stored in the same 
>> > bucket, or on the same HDFS cluster)
>> >
>> > I have created a PR, which provides a FileIO implementation which uses 
>> > FlinkFileSystem. Very imaginatively I have named it FlinkFileIO. See: 
>> > https://github.com/apache/iceberg/pull/10151
>> >
>> > This would allow the users to configure the FileSystem only once, and use 
>> > this FileSystem to access Iceberg tables. Also, if for whatever reason the 
>> > global nature of flink file system config is limiting, the users still 
>> > could revert back using the other FileIO implementations.
>> >
>> > What do you think? Would this be a useful addition to the Iceberg-Flink 
>> > integration?
>> >
>> > Thanks,
>> > Peter

Re: FlinkFileIO implementation

Reply via email to