Re: FlinkFileIO implementation

2024-04-28 Thread Ryan Blue
I agree that there are cases where different implementations can be better than others. But what I'm trying to say is that the contract for FileIO is that the engine must use the one provided by the Table API. The engine doesn't know what is configured correctly and can't swap them out. And catalog

Re: FlinkFileIO implementation

2024-04-25 Thread Péter Váry
I think that the FileIO API is very limited for a good reason. The goal is to allow a wide variety of implementations, and this is what we are seeing here. All of the implementations has their pros and cons, and the actual use-case defines the best one the user should use. For example, my first te

Re: FlinkFileIO implementation

2024-04-25 Thread Jean-Baptiste Onofré
Good point about the schemas. That's true, it would be more complicated. Agree to have more discussion about that. Personally, I think it's not a bad idea to have the catalog as the "source" for FileIO, and let the engine/client deal with that. I think it's an engine/client responsibility (I remem

Re: FlinkFileIO implementation

2024-04-25 Thread Steven Wu
agree with Dan that ResolvingFileIO solves a different problem to resolve FileIO based on storage schema (like s3) and is probably not a good fit for what Peter is trying to do. I also mentioned some downsides of FlinkFileSystemFileIO in the PR. It doesn't support batch deletes or progressive uplo

Re: FlinkFileIO implementation

2024-04-25 Thread Daniel Weeks
JB, The ResolvingFIleIO is somewhat a different issue and more complicated with a concept like FlinkFileIO because the schemes would overlap. The main issue here is around how Flink handles file system operations outside of the Iceberg space (e.g. checkpointing) and the confusion it causes for pe

Re: FlinkFileIO implementation

2024-04-25 Thread Jean-Baptiste Onofré
Hi Peter, On a similar topic, I created a PR to support custom schema in ResolvingFileIO (https://github.com/apache/iceberg/pull/9884). Maybe the FlinkIO can be a new schema/extension in the ResolvingFileIO. If I agree that it would be interesting to have support for FlinkFileIO, I'm not sure it'

Re: FlinkFileIO implementation

2024-04-24 Thread Péter Váry
Hi Ryan, In my experience the different engines often have different ways to distribute the authentication information. Some engines use IAM roles to access S3, some engines uses security tokens to authenticate, which means, we already have a different Catalog/FileIO configurations on engine level

Re: FlinkFileIO implementation

2024-04-23 Thread Ryan Blue
What I mean is that the FileIO implementation is determined by the catalog and the table, not by the engine. The engine should be able to use any valid FileIO implementation. On Tue, Apr 23, 2024 at 2:31 AM Péter Váry wrote: > Hi Ryan, > > The intended use of the *FlinkFileSystemIO* is to set it

Re: FlinkFileIO implementation

2024-04-23 Thread Péter Váry
Hi Ryan, The intended use of the *FlinkFileSystemIO* is to set it through the *Catalog*, like this: *Map props = new HashMap<>(3);* *props.put(CatalogProperties.WAREHOUSE_LOCATION, warehouse); props.put(CatalogProperties.URI, uri); props.put(CatalogProperties.FILE_IO_IMPL, Fli

Re: FlinkFileIO implementation

2024-04-22 Thread Ryan Blue
I think the idea of introducing a Flink-specific FileIO isn't a good idea. The intent of the Java API is for a table to use the FileIO instance that is supplied by the table object. That puts the responsibility for supplying a correctly configure FileIO on the catalog, which is the right place to i

RE: FlinkFileIO implementation

2024-04-22 Thread Ferenc Csaky
Hi Peter, I am coming from the Flink side, but at Cloudera we also use Iceberg as well. Utilizing the Flink delegation token fw via the Iceberg Java API would be great. I think that simplifying the configuration for Flink related cases also has value on its own, and could help to eliminate some c

FlinkFileIO implementation

2024-04-19 Thread Péter Váry
Hi Iceberg Team, Flink has its own FileSystem implementation. See: https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/ . This FileSystem already has several implementations: - Hadoop - Azure - S3 - Google Cloud Storage - ... As a general rule