Re: FlinkFileIO implementation

Steven Wu Thu, 25 Apr 2024 10:51:06 -0700

agree with Dan that ResolvingFileIO solves a different problem to resolve
FileIO based on storage schema (like s3) and is probably not a good fit for
what Peter is trying to do.


I also mentioned some downsides of FlinkFileSystemFileIO in the PR. It
doesn't support batch deletes or progressive upload like S3FileIO. These
gaps also exist in HadoopFileIO. But we probably wouldn't recommend
HadoopFileIO for S3 storage.

Hence, I am also in favor of more discussion.

On Thu, Apr 25, 2024 at 10:41 AM Daniel Weeks <[email protected]>
wrote:

> JB,
>
> The ResolvingFIleIO is somewhat a different issue and more complicated
> with a concept like FlinkFileIO because the schemes would overlap.
>
> The main issue here is around how Flink handles file system operations
> outside of the Iceberg space (e.g. checkpointing) and the confusion it
> causes for people setting up Flink.
>
> I'm concerned that the FlinkFileIO approach will ultimately just push that
> problem to the client side, since much of the FileIO configuration for a
> table will come from the catalog (like Ryan pointed out).
>
> We need to discuss this a little more and see if there's a way to preserve
> catalog/table managed configuration along with simplifying the config for
> users.
>
> -Dan
>
> On Thu, Apr 25, 2024 at 9:48 AM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
>> Hi Peter,
>>
>> On a similar topic, I created a PR to support custom schema in
>> ResolvingFileIO (https://github.com/apache/iceberg/pull/9884). Maybe
>> the FlinkIO can be a new schema/extension in the ResolvingFileIO.
>>
>> If I agree that it would be interesting to have support for
>> FlinkFileIO, I'm not sure it's a good idea to have it directly in the
>> Iceberg. I think it would be great to leverage the extension mechanism
>> we have in Iceberg (FileIO/ResolvingFileIO).
>> Iceberg Core should not include engine specific dependency imho.
>> However, having a "flink:" schema in ResolvingFileIO where we can
>> leverage FlinkFileIO could be interesting.
>>
>> Just thinking out loud :)
>>
>> Regards
>> JB
>>
>> On Fri, Apr 19, 2024 at 12:08 PM Péter Váry <[email protected]>
>> wrote:
>> >
>> > Hi Iceberg Team,
>> >
>> > Flink has its own FileSystem implementation. See:
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/
>> .
>> > This FileSystem already has several implementations:
>> >
>> > Hadoop
>> > Azure
>> > S3
>> > Google Cloud Storage
>> > ...
>> >
>> > As a general rule in Flink, one should use this FileSystem to consume
>> and persistently store data.
>> > If these FileSystems are configured, then Flink makes sure that the
>> configurations are consistent and available for the JM/TM.
>> > Also as an added benefit, delegation tokens are handled and distributed
>> for these FileSystems automatically. See:
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-delegation-token/
>> >
>> > In house, some of our new users are struggling with parametrizing
>> HadooFileIO, and S3FileIO for Iceberg, trying to wrap their head around
>> that they have to provide different configurations for the checkpointing
>> and for the Iceberg table storage (even if they are stored in the same
>> bucket, or on the same HDFS cluster)
>> >
>> > I have created a PR, which provides a FileIO implementation which uses
>> FlinkFileSystem. Very imaginatively I have named it FlinkFileIO. See:
>> https://github.com/apache/iceberg/pull/10151
>> >
>> > This would allow the users to configure the FileSystem only once, and
>> use this FileSystem to access Iceberg tables. Also, if for whatever reason
>> the global nature of flink file system config is limiting, the users still
>> could revert back using the other FileIO implementations.
>> >
>> > What do you think? Would this be a useful addition to the Iceberg-Flink
>> integration?
>> >
>> > Thanks,
>> > Peter
>>
>

Re: FlinkFileIO implementation

Reply via email to