I agree that there are cases where different implementations can be better
than others. But what I'm trying to say is that the contract for FileIO is
that the engine must use the one provided by the Table API. The engine
doesn't know what is configured correctly and can't swap them out. And
catalog
I think that the FileIO API is very limited for a good reason. The goal is
to allow a wide variety of implementations, and this is what we are seeing
here.
All of the implementations has their pros and cons, and the actual use-case
defines the best one the user should use.
For example, my first te
Good point about the schemas. That's true, it would be more complicated.
Agree to have more discussion about that. Personally, I think it's not
a bad idea to have the catalog as the "source" for FileIO, and let the
engine/client deal with that.
I think it's an engine/client responsibility (I remem
agree with Dan that ResolvingFileIO solves a different problem to resolve
FileIO based on storage schema (like s3) and is probably not a good fit for
what Peter is trying to do.
I also mentioned some downsides of FlinkFileSystemFileIO in the PR. It
doesn't support batch deletes or progressive uplo
JB,
The ResolvingFIleIO is somewhat a different issue and more complicated with
a concept like FlinkFileIO because the schemes would overlap.
The main issue here is around how Flink handles file system operations
outside of the Iceberg space (e.g. checkpointing) and the confusion it
causes for pe
Hi Peter,
On a similar topic, I created a PR to support custom schema in
ResolvingFileIO (https://github.com/apache/iceberg/pull/9884). Maybe
the FlinkIO can be a new schema/extension in the ResolvingFileIO.
If I agree that it would be interesting to have support for
FlinkFileIO, I'm not sure it'
Hi Ryan,
In my experience the different engines often have different ways to
distribute the authentication information. Some engines use IAM roles to
access S3, some engines uses security tokens to authenticate, which means,
we already have a different Catalog/FileIO configurations on engine level
What I mean is that the FileIO implementation is determined by the catalog
and the table, not by the engine. The engine should be able to use any
valid FileIO implementation.
On Tue, Apr 23, 2024 at 2:31 AM Péter Váry
wrote:
> Hi Ryan,
>
> The intended use of the *FlinkFileSystemIO* is to set it
Hi Ryan,
The intended use of the *FlinkFileSystemIO* is to set it through the
*Catalog*, like this:
*Map props = new HashMap<>(3);*
*props.put(CatalogProperties.WAREHOUSE_LOCATION, warehouse);
props.put(CatalogProperties.URI, uri);
props.put(CatalogProperties.FILE_IO_IMPL, Fli
I think the idea of introducing a Flink-specific FileIO isn't a good idea.
The intent of the Java API is for a table to use the FileIO instance that
is supplied by the table object. That puts the responsibility for supplying
a correctly configure FileIO on the catalog, which is the right place to
i
Hi Peter,
I am coming from the Flink side, but at Cloudera we also use
Iceberg as well.
Utilizing the Flink delegation token fw via the Iceberg Java API
would be great. I think that simplifying the configuration for
Flink related cases also has value on its own, and could help to
eliminate some c
Hi Iceberg Team,
Flink has its own FileSystem implementation. See:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/
.
This FileSystem already has several implementations:
- Hadoop
- Azure
- S3
- Google Cloud Storage
- ...
As a general rule
12 matches
Mail list logo