Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Walaa Eldin Moustafa
Thank you Eduard for sharing this version of the proposal. Looks simple, functional, and extensible. On Thu, Aug 15, 2024 at 1:10 PM Ryan Blue wrote: > I think I'm fine either way. I lean toward the simplicity of the strings > in the proposal but would not complain if we went with Yufei's sugges

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Walaa Eldin Moustafa
The option of using catalog identifiers in the state map still requires keeping lineage information in the view because REFRESH MV needs the latest fully expanded children (which could have changed from the set of children currently in the state map), without reparsing the view tree. Therefore, cat

Re: [VOTE] Release Apache Iceberg Rust 0.3.0 RC1

2024-08-15 Thread Christian Thiel
+1 (non-binding) From: Xuanwo Date: Wednesday, 14. August 2024 at 17:58 To: dev@iceberg.apache.org Subject: [VOTE] Release Apache Iceberg Rust 0.3.0 RC1 Hello, Apache Iceberg Rust Community, This is a call for a vote to release Apache Iceberg rust version 0.3.0. The tag to be voted on is 0.3.0

Re: [DISCUSS] Iceberg-rust based Ruby bindings

2024-08-15 Thread Zheng Hu
>From my understanding, the most abstracted approach for implementing a multi-language SDK is: building another language SDK (Python, Ruby, Go, etc) on top of the Iceberg-Rust SDK. In this case, we can make our community resources focus on the rust native kernel, and all of the other language bi

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Gang Wu
+ dev@arrow Thanks for all the valuable suggestions! I am inclined to Micah's idea that Arrow might be a better host compared to Parquet. To give more context, I am taking the initiative to add the geometry type to both Parquet and ORC. I'd like to do the same thing for variant type in that varia

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Jingsong Li
Thanks all for your discussion. The Apache Paimon community is also considering support for this Variant type, without a doubt, we hope to maintain consistency with Iceberg. Not only the Paimon community, but also various computing engines need to adapt to this type, such as Flink and StarRocks.

Re: [DISCUSS] adoption of format version 3

2024-08-15 Thread Ryan Blue
Quick update: I just opened PR 10948 with some prep work for v3. The main change is that it makes the support requirements for unknown transforms clear: * Writers are not allowed to commit data using a partition spec that contains a field with an unkno

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Micah Kornfield
> > Thats fair @Micah, so far all the discussions have been direct and off the > dev list. Would you like to make the request on the public Spark Dev list? > I would be glad to co-sign, I can also draft up a quick email if you don't > have time. I think once we come to consensus, if you have band

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Micah Kornfield
> > I think given the constraint that catalog lookup has to be by identifier > and not UUID, I'd prefer using identifier in the refresh state. If we use > identifiers, we can directly parallelize the catalog calls to fetch the > latest state. If we use UUID, the engine has to go back to the MV an

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Micah Kornfield
> > I do not think 3 and 4 are at odds with each other (for example > maintaining both lineage map and state map through UUID can achieve both). I agree, I should have been more clear that #5 (limiting new view versions) also comes into play. If UUID is used in lineage as part of the view spec,

Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-15 Thread Sung Yun
Hi Daniel, thank you very much for testing the installation thoroughly and reporting these issues. We make note of the supported Python versions using the PyPi classifiers , but I agree t

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Benny Chow
I think given the constraint that catalog lookup has to be by identifier and not UUID, I'd prefer using identifier in the refresh state. If we use identifiers, we can directly parallelize the catalog calls to fetch the latest state. If we use UUID, the engine has to go back to the MV and possibly

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Micah Kornfield
> > I think the Parquet community is the most neutral option available. Would > anyone else support asking the Spark and Parquet communities to maintain > the variant spec in Parquet? This makes sense to me. I'll reiterate that Arrow might be a better potential home for this for a few different

Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-15 Thread Daniel Weeks
I ran into a couple issues while trying to verify the release. The first appears to be a transient issue (we ran into something similar in the 0.6.1 release but I was able to install later). Package docutils (0.21.post1) not found. make: *** [install-dependencies] Error 1 The second issue is mor

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Walaa Eldin Moustafa
Thanks Jan, Micah, and Karuppayya for chiming in. I do not think 3 and 4 are at odds with each other (for example maintaining both lineage map and state map through UUID can achieve both). Also, I do not think we can drop the lineage map since in many catalogs, the only lookup method is by the cat

Re: Spark: Copy Table Action

2024-08-15 Thread Yufei Gu
Sorry for the late reply. > I was wondering if we also want to support the use case of moving tables in this proposal? Pucheng, yes, we could use the action to move tables. Hi Sumedh, here are my answers to your questions: > Should the copied table registered in same catalog as the source table

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Daniel Weeks
I would agree that Parquet seems like a reasonable option in terms of fit and neutrality. I'd love to get any feedback from others, but assuming there's general consensus, I feel like we need to engage with those communities and have an open conversation about the discussions we've had and why we

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread karuppayya
+1 to storing the refresh state as a map of UUIDs to snapshot IDs, and deferring the inclusion of lineage to a future iteration.(like Micha mentioned) This would greatly simplify the current design. Also in terms of identifiers to use(UUID or catalog identifier) for the refresh state We will not b

Re: [Early Feedback] Variant and Subcolumnarization Support

2024-08-15 Thread Ryan Blue
To follow up on the idea of multiple physical types for a shredded column, we had a discussion internally about this and I think it's pretty reasonable to add that later if we end up needing it. I agree that there's no pressing need to add that complication to the spec. On Wed, Aug 14, 2024 at 2:4

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Ryan Blue
I think I'm fine either way. I lean toward the simplicity of the strings in the proposal but would not complain if we went with Yufei's suggestion. On Thu, Aug 15, 2024 at 12:12 PM Yufei Gu wrote: > The current proposal lists endpoints as plain strings, and I still believe > we could make things

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Yufei Gu
The current proposal lists endpoints as plain strings, and I still believe we could make things a bit smoother by adding some structure to them. Here's the example if the previous one throws you off. *Before:* "GET /v1/{prefix}/namespaces","POST /v1/{prefix}/namespaces","GET /v1/{prefix}/namespac

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Russell Spitzer
I support that whole-heartedly. Parquet would be a great neutral location for the spec. On Thu, Aug 15, 2024 at 1:17 PM Ryan Blue wrote: > I think it's a good idea to reach out to the Spark community and make sure > we are in agreement. Up until now I think we've been thinking more > abstractly

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Ryan Blue
I think it's a good idea to reach out to the Spark community and make sure we are in agreement. Up until now I think we've been thinking more abstractly about what makes sense but before we make any decision we should definitely collaborate with the other communities. I'd also like to suggest an a

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Jack Ye
> But I propose to use a trimmed openAPI's format directly. Looking at the example, this feels quite complicated to me. > For example, it is easier if we want to include operationID I don't think we need to consider accommodating both, since operationId is an alternative to " ". > or adding featu

Re: [DISCUSS] Iceberg 1.6.1 release

2024-08-15 Thread Piotr Findeisen
Hey Fokko, Given that Avro 1.11.4 Java release was "1-2 weeks" a week ago, it should be done or in progress by now :) It seems the discussion https://lists.apache.org/thread/yycy9bp21r4cgq68vk9d66bkqrb162tq stalled 5 days ago though. Should we restart it, or rather go ahead with the release and le

Re: Support row filter & column masking in REST spec

2024-08-15 Thread Yufei Gu
Sorry, I gave the wrong doc, here is the proposal to enable row filtering and column mask: https://docs.google.com/document/d/14nmuxxfzQsYo59o0Fbpb-pxOlzS6bVtduL8P8pwKZ6U/edit#heading=h.irh2zymohx17 Yufei On Thu, Aug 15, 2024 at 9:49 AM Yufei Gu wrote: > Hi Shoham, > > I think this would be a

Re: Support row filter & column masking in REST spec

2024-08-15 Thread Yufei Gu
Hi Shoham, I think this would be a part of the REST Scan APIs. Here is the proposal, https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h Yufei On Thu, Aug 15, 2024 at 9:28 AM Shoham Yamin wrote: > Hi what are you thinking about adding in

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Yufei Gu
+1 for the proposal. In terms of the format, the current solution is simple enough. But I propose to use a trimmed openAPI's format directly. It won't add much cost as we can just take the minimum fields we want. But it opens a window to extend it in the future. For example, it is easier if we want

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Ryan Blue
I think it's more straightforward to use the format from the existing proposal. That's unambiguous and seems easier to understand to me, rather than needing to refer to the spec to find out the details. The `UpdateTable` vs `CommitTable` discrepancy is a good example, and we could have names that a

Support row filter & column masking in REST spec

2024-08-15 Thread Shoham Yamin
Hi what are you thinking about adding in the rest catalog an option for getting a row filter expression for each table and column mask expression for each column That way every query engine will know how to apply column mask and row filter Here is my issue regarding that: https://github.com/apache/

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Micah Kornfield
I think it might be worth restating perceived requirements and making sure there is alignment on them. If I am reading correctly, I think the following are perceived requirements: 1. An engine must be able to unambiguously detect that an underlying queried entity has changed or not via metadata to

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Russell Spitzer
I'm on board for this proposal. I was in the off-mail chats and I think this is probably our simplest approach going forward. On Thu, Aug 15, 2024 at 10:39 AM Dmitri Bourlatchkov wrote: > OpenAPI tool will WARN a lot if Operation IDs overlap. Generated code/html > may also look odd in case of ov

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Dmitri Bourlatchkov
OpenAPI tool will WARN a lot if Operation IDs overlap. Generated code/html may also look odd in case of overlaps. All-in-all, I think the best practice is to define unique Operation IDs up front. For Iceberg REST API, the yaml file is the API definition, so it should not be a problem to ensure th

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Xuanwo
>From the iceberg-rust perspective, it could be extremely challenging to keep >track of both the Spark and Iceberg specifications. Having a single source of >truth would be much better. I believe this change will also benefit Delta Lake >if they implement the same approach. Perhaps we can try co

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Eduard Tudenhöfner
Hey Jack, thanks for the feedback. I replied in the doc but I can reiterate my answer here too: The *path* is unique and required so that feels more appropriate than requiring to have an optional *operationId* in the OpenAPI spec. Additionally, using the path is more straight-forward when we intro

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Jack Ye
Hi Eduard, In general I agree with this proposal, thanks for putting this up! Just one question (which I also added in the design), what are the thoughts behind using " ", vs using the operationId defined in the OpenAPI? The operationId approach definitely looks much cleaner to me, but (1) in Ope

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Gang Wu
+1 on posting this discussion to dev@spark ML > I don't think there is anything that would stop us from moving to a joint project in the future My concern is that if we don't do this from day 1, we will never ever do this. Best, Gang On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer wrote: > T

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Russell Spitzer
Thats fair @Micah, so far all the discussions have been direct and off the dev list. Would you like to make the request on the public Spark Dev list? I would be glad to co-sign, I can also draft up a quick email if you don't have time. On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield wrote: > I

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Micah Kornfield
> > I agree that it would be beneficial to make a sub-project, the main > problem is political and not logistic. I've been asking for movement from > other relative projects for a month and we simply haven't gotten anywhere. I just wanted to double check that these issues were brought directly to

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Benny Chow
If we go with either UUID or Table Identifier + VersionID/SnapshotId in the refresh state, then this list is fully expanded already. So, to validate the freshness of a materialization, the engine doesn't even need to look at the view lineage. IMO, the view lineage is nice to have but not a necess

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Russell Spitzer
@Gang Wu I agree that it would be beneficial to make a sub-project, the main problem is political and not logistic. I've been asking for movement from other relative projects for a month and we simply haven't gotten anywhere. I don't think there is anything that would stop us from moving to a join

Re: [VOTE] Release Apache Iceberg Rust 0.3.0 RC1

2024-08-15 Thread NOTME ZE
+1. Thanks @Xuanwo for raising this! Xuanwo 于2024年8月14日周三 23:57写道: > Hello, Apache Iceberg Rust Community, > > This is a call for a vote to release Apache Iceberg rust version 0.3.0. > > The tag to be voted on is 0.3.0. > > The release candidate: > > https://dist.apache.org/repos/dist/dev/iceber

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Jan Kaul
Hi all, I would like to reemphasize the purpose of the refresh-state for materialized views. The purpose is to determine if the precomputed data is fresh, stale or invalid. For that the current snapshot-id of every table in the query tree has to be fetched from the catalog by using its full i