Hi everyone, I tried to consolidate S3 properties in this PR: https://github.com/apache/iceberg/pull/11321
Hopefully we can start building a single source of truth from it. Thanks for your review, -Hsiang > On Aug 7, 2024, at 3:44 AM, Kevin Liu <kevin.jq....@gmail.com> wrote: > > +1 on standardizing, and possibly extending this to include catalog > properties. > > On the PyIceberg side, a recent development is the ability to separate S3 > FileIO configurations from the Glue Catalog configurations, with an optional > configuration to use the same for both if specified. See Unified AWS > Credentials > <https://py.iceberg.apache.org/configuration/#unified-aws-credentials> and > Github Issue #892 <https://github.com/apache/iceberg-python/issues/892> > > So for AWS credentials, there are currently 3 different properties for > `access-key-id` > * `s3.access-key-id` (S3 FileIO specific) > * `glue.access-key-id` (Glue Catalog specific) > * `client.access-key-id` (Unified) > > Thanks, > Kevin Liu > > > > > On Wed, Jul 31, 2024 at 10:05 AM Xuanwo <xua...@apache.org > <mailto:xua...@apache.org>> wrote: >> Thanks you all. I'm going to prepare a proposal PR for this. >> >> On Fri, Jul 12, 2024, at 10:06, Honah J. wrote: >>> Hello everyone, >>> >>> Thank you all for the valuable insights. I am also +1 on having >>> standardized names for File IO properties. Creating a dedicated section to >>> summarize property names in the Java implementation is a good starting >>> point. Since pyiceberg, icebergRust, and IcebergGolang will support only >>> subsets of these properties for some time (with the rest to be added in >>> future development), the existing Java implementation will serve as a >>> useful reference. Additionally, we could establish general naming >>> conventions in the doc, such as using the “s3.” prefix for S3 properties >>> and hyphens to connect words. >>> >>> Best regards, >>> Honah >>> >>> On Wed, Jul 10, 2024 at 10:47 AM <ndrl...@proton.me.invalid> wrote: >>> >>> >>> I don't know what the recommended way to start standardizing is. We can >>> start a proposal for each context or have one proposal to handle all. >>> >>> Suggested contexts to start with: >>> Rest Catalog >>> FileIO >>> >>> I believe that most of the other cases are supported by the configuration >>> topic in the Table section[1], but this is about the Java implementation. >>> Maybe we need to create a page in the project section[2] to handle the >>> properties in the table section and the Rest and FileIO contexts. >>> >>> >>> [1]: https://iceberg.apache.org/docs/latest/configuration/ >>> [2]: https://iceberg.apache.org/community/ >>> On Wednesday, July 10th, 2024 at 11:58 AM, Russell Spitzer >>> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>> wrote: >>>> Sounds reasonable to me >>>> >>>> On Wed, Jul 10, 2024 at 9:28 AM Renjie Liu <liurenjie2...@gmail.com >>>> <mailto:liurenjie2...@gmail.com>> wrote: >>>> Hi: >>>> >>>> +1 for standardizing iceberg properties. This will help to align different >>>> language implementations. >>>> >>>> On Wed, Jul 10, 2024 at 9:44 PM <ndrl...@proton.me.invalid> wrote: >>>> >>>> Hello Everyone, >>>> >>>> I was considering discussing the standardization of Iceberg properties, >>>> and I believe this thread could be a great place to start. >>>> >>>> I'm writing an Iceberg client in Elixir and using the Java, Python, and >>>> Rust implementations as references. However, I've had some difficulty >>>> determining which configurations we must support and what each client has >>>> implemented. Therefore, I agree with Xuanwo about having a separate >>>> section as a single source of truth (SSOT). >>>> >>>> Additionally, I think it would be beneficial for each client to show what >>>> it does not support. This would make it easier for users to know that a >>>> particular client might not work with some configuration that their >>>> catalog could define as default or override. It would also help us, as >>>> contributors, to know which configurations we need to implement support >>>> for. >>>> >>>> For example, the "s3.signer"[1] and "s3.proxy-uri"[2] configurations only >>>> exist in the Python implementation. I believe it is not clear that these >>>> configurations are exclusive to Python, and they might be configurations >>>> that the catalog could override or define as defaults in the get info >>>> endpoint. Without an SSOT, this could be harder to track. >>>> >>>> Another example is the "rest.authorization-url" in Python and Rust versus >>>> "oauth2_server_uri" in Java. Although this is a bit out of scope for this >>>> thread, I will open another discussion topic about broader standardization >>>> of available properties. >>>> >>>> >>>> [1]: >>>> https://github.com/search?q=repo%3Aapache%2Ficeberg-python+s3.signer&type=code >>>> [2]: >>>> https://github.com/search?q=repo%3Aapache%2Ficeberg-python%20S3_PROXY_URI&type=code >>>> >>>> On Wednesday, July 10th, 2024 at 7:51 AM, Fokko Driesprong >>>> <fo...@apache.org <mailto:fo...@apache.org>> wrote: >>>>> Hey Xuanwo, >>>>> >>>>> Thanks for raising this. >>>>> The S3 properties are largely covered under the S3FileIO page: >>>>> https://iceberg.apache.org/docs/nightly/aws/#s3-fileio. But it looks like >>>>> some important ones are missing indeed. I've raised an issue here >>>>> <https://github.com/apache/iceberg/issues/10674>. >>>>> For PyIceberg it only supports like a subset of the functionality, and >>>>> therefore also many properties are missing there. >>>>> For the REST Catalog, there is an open PR to add >>>>> <https://github.com/apache/iceberg/pull/10576> the options for GCS and >>>>> ADLS. It would be great to get some more eyes on there. >>>>> That being said, I do think there is value in formalizing them. When >>>>> adding configuration options to PyIceberg, I'll make sure to check out >>>>> the Java implementation to ensure that we use the same property. >>>>> >>>>> Kind regards, >>>>> Fokko >>>>> >>>>> Op wo 10 jul 2024 om 09:22 schreef Xuanwo <xua...@apache.org >>>>> <mailto:xua...@apache.org>>: >>>>> Hello everyone >>>>> >>>>> I've been working on the iceberg-rust FileIO recently and have found it >>>>> challenging to identify all the necessary IO properties we need to >>>>> support. >>>>> >>>>> For instance, consider AWS S3. There are no documents specifying which >>>>> properties are supported by S3. >>>>> >>>>> The only relevant documentation I could find includes: >>>>> >>>>> - Iceberg AWS Integrations[1]: Does not define `s3.access-key-id` or >>>>> `s3.secret-access-key`. >>>>> - Pyiceberg configuration[2]: Missing several S3-related properties. >>>>> - Iceberg REST Catalog[3]: Does not cover all storage services. >>>>> >>>>> To gather this information, we must refer to the S3FileIO Java code[4]. >>>>> >>>>> I propose adding a separate section for agreeing upon these properties. >>>>> We could create a specification that outlines all IO properties with >>>>> indications of whether they are required or optional, along with their >>>>> expected behaviors. This would help ensure consistency across different >>>>> implementations without any conflicts. >>>>> >>>>> >>>>> [1]: https://iceberg.apache.org/docs/latest/aws/ >>>>> [2]: https://py.iceberg.apache.org/configuration/#s3 >>>>> [3]: >>>>> https://github.com/apache/iceberg/blob/eee81c59199a54e749ea58dae070eb066d9a5f9e/open-api/rest-catalog-open-api.yaml#L2737 >>>>> [4]: >>>>> https://github.com/apache/iceberg/blob/2b21020aedb63c26295005d150c05f0a5a5f0eb2/aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java#L46 >>>>> >>>>> Xuanwo >>>>> >>>>> https://xuanwo.io/ >> Xuanwo >> >> https://xuanwo.io/ >>