Hi everyone,

I tried to consolidate S3 properties in this PR: 
https://github.com/apache/iceberg/pull/11321 

Hopefully we can start building a single source of truth from it.

Thanks for your review,
-Hsiang

> On Aug 7, 2024, at 3:44 AM, Kevin Liu <kevin.jq....@gmail.com> wrote:
> 
> +1 on standardizing, and possibly extending this to include catalog 
> properties. 
> 
> On the PyIceberg side, a recent development is the ability to separate S3 
> FileIO configurations from the Glue Catalog configurations, with an optional 
> configuration to use the same for both if specified. See Unified AWS 
> Credentials 
> <https://py.iceberg.apache.org/configuration/#unified-aws-credentials> and 
> Github Issue #892 <https://github.com/apache/iceberg-python/issues/892>
> 
> So for AWS credentials, there are currently 3 different properties for 
> `access-key-id`
> * `s3.access-key-id` (S3 FileIO specific) 
> * `glue.access-key-id` (Glue Catalog specific)
> * `client.access-key-id` (Unified)
> 
> Thanks,
> Kevin Liu
> 
> 
> 
> 
> On Wed, Jul 31, 2024 at 10:05 AM Xuanwo <xua...@apache.org 
> <mailto:xua...@apache.org>> wrote:
>> Thanks you all. I'm going to prepare a proposal PR for this. 
>> 
>> On Fri, Jul 12, 2024, at 10:06, Honah J. wrote:
>>> Hello everyone,
>>> 
>>> Thank you all for the valuable insights. I am also +1 on having 
>>> standardized names for File IO properties. Creating a dedicated section to 
>>> summarize property names in the Java implementation is a good starting 
>>> point. Since pyiceberg, icebergRust, and IcebergGolang will support only 
>>> subsets of these properties for some time (with the rest to be added in 
>>> future development), the existing Java implementation will serve as a 
>>> useful reference. Additionally, we could establish general naming 
>>> conventions in the doc, such as using the “s3.” prefix for S3 properties 
>>> and hyphens to connect words.
>>> 
>>> Best regards,
>>> Honah
>>> 
>>> On Wed, Jul 10, 2024 at 10:47 AM <ndrl...@proton.me.invalid> wrote:
>>> 
>>> 
>>> I don't know what the recommended way to start standardizing is. We can 
>>> start a proposal for each context or have one proposal to handle all.
>>> 
>>> Suggested contexts to start with:
>>> Rest Catalog
>>> FileIO
>>> 
>>> I believe that most of the other cases are supported by the configuration 
>>> topic in the Table section[1], but this is about the Java implementation. 
>>> Maybe we need to create a page in the project section[2] to handle the 
>>> properties in the table section and the Rest and FileIO contexts.
>>> 
>>> 
>>> [1]: https://iceberg.apache.org/docs/latest/configuration/
>>> [2]: https://iceberg.apache.org/community/
>>> On Wednesday, July 10th, 2024 at 11:58 AM, Russell Spitzer 
>>> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>> wrote:
>>>> Sounds reasonable to me
>>>> 
>>>> On Wed, Jul 10, 2024 at 9:28 AM Renjie Liu <liurenjie2...@gmail.com 
>>>> <mailto:liurenjie2...@gmail.com>> wrote:
>>>> Hi:
>>>> 
>>>> +1 for standardizing iceberg properties. This will help to align different 
>>>> language implementations.
>>>> 
>>>> On Wed, Jul 10, 2024 at 9:44 PM <ndrl...@proton.me.invalid> wrote:
>>>> 
>>>> Hello Everyone,
>>>> 
>>>> I was considering discussing the standardization of Iceberg properties, 
>>>> and I believe this thread could be a great place to start.
>>>> 
>>>> I'm writing an Iceberg client in Elixir and using the Java, Python, and 
>>>> Rust implementations as references. However, I've had some difficulty 
>>>> determining which configurations we must support and what each client has 
>>>> implemented. Therefore, I agree with Xuanwo about having a separate 
>>>> section as a single source of truth (SSOT).
>>>> 
>>>> Additionally, I think it would be beneficial for each client to show what 
>>>> it does not support. This would make it easier for users to know that a 
>>>> particular client might not work with some configuration that their 
>>>> catalog could define as default or override. It would also help us, as 
>>>> contributors, to know which configurations we need to implement support 
>>>> for.
>>>> 
>>>> For example, the "s3.signer"[1] and "s3.proxy-uri"[2] configurations only 
>>>> exist in the Python implementation. I believe it is not clear that these 
>>>> configurations are exclusive to Python, and they might be configurations 
>>>> that the catalog could override or define as defaults in the get info 
>>>> endpoint. Without an SSOT, this could be harder to track.
>>>> 
>>>> Another example is the "rest.authorization-url" in Python and Rust versus 
>>>> "oauth2_server_uri" in Java. Although this is a bit out of scope for this 
>>>> thread, I will open another discussion topic about broader standardization 
>>>> of available properties.
>>>> 
>>>> 
>>>> [1]: 
>>>> https://github.com/search?q=repo%3Aapache%2Ficeberg-python+s3.signer&type=code
>>>> [2]: 
>>>> https://github.com/search?q=repo%3Aapache%2Ficeberg-python%20S3_PROXY_URI&type=code
>>>> 
>>>> On Wednesday, July 10th, 2024 at 7:51 AM, Fokko Driesprong 
>>>> <fo...@apache.org <mailto:fo...@apache.org>> wrote:
>>>>> Hey Xuanwo,
>>>>> 
>>>>> Thanks for raising this.
>>>>> The S3 properties are largely covered under the S3FileIO page: 
>>>>> https://iceberg.apache.org/docs/nightly/aws/#s3-fileio. But it looks like 
>>>>> some important ones are missing indeed. I've raised an issue here 
>>>>> <https://github.com/apache/iceberg/issues/10674>.
>>>>> For PyIceberg it only supports like a subset of the functionality, and 
>>>>> therefore also many properties are missing there.
>>>>> For the REST Catalog, there is an open PR to add 
>>>>> <https://github.com/apache/iceberg/pull/10576> the options for GCS and 
>>>>> ADLS. It would be great to get some more eyes on there.
>>>>> That being said, I do think there is value in formalizing them. When 
>>>>> adding configuration options to PyIceberg, I'll make sure to check out 
>>>>> the Java implementation to ensure that we use the same property.
>>>>> 
>>>>> Kind regards,
>>>>> Fokko
>>>>> 
>>>>> Op wo 10 jul 2024 om 09:22 schreef Xuanwo <xua...@apache.org 
>>>>> <mailto:xua...@apache.org>>:
>>>>> Hello everyone
>>>>> 
>>>>> I've been working on the iceberg-rust FileIO recently and have found it 
>>>>> challenging to identify all the necessary IO properties we need to 
>>>>> support.
>>>>> 
>>>>> For instance, consider AWS S3. There are no documents specifying which 
>>>>> properties are supported by S3.
>>>>> 
>>>>> The only relevant documentation I could find includes:
>>>>> 
>>>>> - Iceberg AWS Integrations[1]: Does not define `s3.access-key-id` or 
>>>>> `s3.secret-access-key`.
>>>>> - Pyiceberg configuration[2]: Missing several S3-related properties.
>>>>> - Iceberg REST Catalog[3]: Does not cover all storage services.
>>>>> 
>>>>> To gather this information, we must refer to the S3FileIO Java code[4].
>>>>> 
>>>>> I propose adding a separate section for agreeing upon these properties. 
>>>>> We could create a specification that outlines all IO properties with 
>>>>> indications of whether they are required or optional, along with their 
>>>>> expected behaviors. This would help ensure consistency across different 
>>>>> implementations without any conflicts.
>>>>> 
>>>>> 
>>>>> [1]: https://iceberg.apache.org/docs/latest/aws/
>>>>> [2]: https://py.iceberg.apache.org/configuration/#s3
>>>>> [3]: 
>>>>> https://github.com/apache/iceberg/blob/eee81c59199a54e749ea58dae070eb066d9a5f9e/open-api/rest-catalog-open-api.yaml#L2737
>>>>> [4]: 
>>>>> https://github.com/apache/iceberg/blob/2b21020aedb63c26295005d150c05f0a5a5f0eb2/aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java#L46
>>>>> 
>>>>> Xuanwo
>>>>> 
>>>>> https://xuanwo.io/
>> Xuanwo
>> 
>> https://xuanwo.io/
>> 

Reply via email to