Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul
Hi, Thanks Micah for clearly stating the requirements. I think this gives better clarity for the discussion. It seems like we don't have a solution that satisfies all requirements at once. So we would need to choose which has the fewest drawbacks. I would like to summarize the different dra

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul
As the table I created is not properly shown in the mailing list I'll reformat the summary of the different drawbacks again: Drawbacks of (no lineage, refresh-state key = identifier): - introduces catalog identifiers into table metadata (#4) - query engine has to expand lineage at refresh time

Re: [DISCUSS] Changing namespace separator in REST spec

2024-08-16 Thread Eduard Tudenhöfner
I do want to remind us that the original issue was reported by users from the community before we were internally aware of this issue, meaning that there are users of the V1 API that are either running into this issue or will eventually run into this

[VOTE] Make namespace separator configurable in REST Spec

2024-08-16 Thread Eduard Tudenhöfner
Hey everyone, as I mentioned on the DISCUSS thread, this is providing a simple path forward for users of the V1 APIs (make the namespace separator *configurable* instead of *hardcoded*) that are either running into issue #10338 or will eventually wh

Re: [DISCUSS] REST Endpoint discovery

2024-08-16 Thread Eduard Tudenhöfner
If we really want to add more structure to the JSON representation, then I would probably prefer what Dmitri suggested in the doc as I think { "GET": { }, "POST": {} } looks a bit weird: "endpoints":[ {"verb": "GET","path": "/v1/{prefix}/namespaces/{namespace}"}, {"verb": "GET","path": "/v1/{p

Re: Iceberg-arrow vectorized read bug

2024-08-16 Thread Eduard Tudenhöfner
Hey Steve, It's been a long time since I did some work on the iceberg-arrow module but I will try to find some time next week to analyze the problem in detail and see what options we have for fixing it. Thanks for your patience here. Eduard On Mon, Aug 12, 2024 at 9:00 PM Lessard, Steve wrote:

Re: [DISCUSS] REST Endpoint discovery

2024-08-16 Thread Dmitri Bourlatchkov
Sorry for a bit of back-tracking, but I'd like to clarify the meaning of those endpoint strings. I initially assumed that the strings would need to be parsed into components (verb / path) for use in runtime. My suggestion for using a JSON representation was meant to make the parsing more standard

Re: [VOTE] Make namespace separator configurable in REST Spec

2024-08-16 Thread Dmitri Bourlatchkov
+1 (nb) to the spec change. Cheers, Dmitri. On Fri, Aug 16, 2024 at 4:31 AM Eduard Tudenhöfner wrote: > Hey everyone, > > as I mentioned on the DISCUSS thread, this is providing a simple path > forward for users of the V1 APIs (make the namespace separator > *configurable* instead of *hardcoded

Re: [DISCUSS] REST Endpoint discovery

2024-08-16 Thread karuppayya
Can we consider adding `OPTIONS` verb to the resource paths, as part of the spec? That way the endpoint discovery endpoint could return only the list of supported endpoints, without the verbs. `OPTIONS` on the resource path can return the list of all supported verbs, and also other information reg

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
Thanks Jan for the summary. For this point: > For a refresh operation the query engine has to parse the SQL and fully expand the lineage with it's children anyway. So the lineage is not strictly required. If the lineage is provided at creation time by the respective engine, the refresh operatio

Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-16 Thread Daniel Weeks
Thanks Sung! I agree with the comments that this doesn't require a new RC. +1 (binding) Verified sigs/sums/license/build/test with Python 3.11.9 Thanks, -Dan On Thu, Aug 15, 2024 at 3:34 PM Sung Yun wrote: > Hi Daniel, thank you very much for testing the installation thoroughly and > reporti

Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-16 Thread Chinmay Bhat
+1 (non-binding) - Verified signatures, checksums, license - Ran unit tests & test-coverage with Python 3.9.19 Best, Chinmay On Fri, Aug 16, 2024 at 10:02 PM Daniel Weeks wrote: > Thanks Sung! > > I agree with the comments that this doesn't require a new RC. > > +1 (binding) > > Verified sigs/

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Gene Pang
Hi all, I am one of the main developers for Variant in Apache Spark. David Cashman (another one of the main Variant developers) and I have been working on Variant in Spark for a while, and we are excited by the interest from the Iceberg community! We have attended some of the Iceberg dev Variant

Re: [DISCUSS] REST Endpoint discovery

2024-08-16 Thread Yufei Gu
I’m OK with using a plain string for the endpoint ID, as described in doc[1]. However, I’ve been thinking about how we can make this more flexible, especially since we’ve had quite a few discussions about granularity. For instance, if we expose a bit more of the API specification, we might not nee

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Weston Pace
+1 to using Arrow to house the spec. In the interest of expediency I wonder if we could even store it there "on the side" while we figure out how to integrate the variant data type with Arrow. I have a question for those more familiar with the variant spec. Do we think it could be introduced as

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul
Hi Walaa,I would argue that for the refresh operation the query engine has to parse the query and then somehow execute it. For a full refresh it will directly execute the query and for a incremental refresh it will execute a modified version. Therefore it has to fully expand the query tree.Best wis

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Ryan Blue
I think Parquet is a better place for the variant spec than Arrow. Parquet is upstream of nearly every project (other than ORC) so it is a good place to standardize and facilitate discussions across communities. There are also existing relationships and connections to the Parquet community because

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Weston Pace
> Parquet is upstream of nearly every project (other than ORC) I disagree with this statement. There is a difference between being upstream and being the internal format in use. For example, datafusion, duckdb, ray, etc. all have parquet upstream but all of them use Arrow as the internal memory

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Gene Pang
I think Parquet might be a better home over Arrow. Ryan already brought up interesting points, especially with all of the storage related details and discussions, like shredding. Another aspect to this is that while working on Variant, we had ideas of adding a Variant logical type to Parquet. We t

Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug

2024-08-16 Thread Lessard, Steve
Hi Eduard, Thank you for offering to help with this issue. If you are able to find some time to look at this issue I’d be sure to make some time to collaborate with you. I have been continuing to investigate the issue. I still do not have a correct solution, but I believe I have something clos

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Reynold Xin
My $0.02 (as an Apache Spark PMC member): It'd be very unfortunate if there emerges multiple variant specs at the physical storage layer. The most important thing is interoperability at the physical storage layer, since that's by far the most expensive to "convert". Forking will inevitably lead to

Type promotion in v3

2024-08-16 Thread Ryan Blue
I’ve recently been working on updating the spec for new types and type promotion cases in v3. I was talking to Micah and he pointed out an issue with type promotion: the upper and lower bounds for data file columns that are kept in Avro manifests don’t have any information about the type that was

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Will Jones
In being more engine and format agnostic, I agree the Arrow project might be a good host for such a specification. It seems like we want to move away from hosting in Spark to make it engine agnostic. But moving into Iceberg might make it less format agnostic, as I understand multiple formats might

[DISCUSS] Row Lineage Proposal

2024-08-16 Thread Russell Spitzer
Hi Y'all, We've been working on a new proposal to add Row Lineage to Iceberg in the V3 Spec. The general idea is to give every row a unique identifier as well as a marker of what version of the row it is. This should let us build a variety of features related to CDC, Incremental Processing and Aud

Re: [VOTE] Merge REST spec clarification on how servers should handle unknown updates/requirements

2024-08-16 Thread Amogh Jahagirdar
The vote passes 7 +1 binding votes and 1 +1 non-binding vote and I will merge the spec change. Thanks everyone for providing feedback and voting! Thanks, Amogh Jahagirdar On Wed, Aug 14, 2024 at 9:23 PM Renjie Liu wrote: > +1 > > On Thu, Aug 15, 2024 at 10:10 AM Jack Ye wrote: > >> +1 >> >> -

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
That is right. I agree that in the case of using catalog identifiers in state information, using them in lineage information would be a nice-to-have and not a requirement. However, this still does not address the semantic issue which is more fundamental in my opinion. The Iceberg table spec is not

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Micah Kornfield
> > However, this still does not address the semantic issue which is more > fundamental in my opinion. The Iceberg table spec is not aware of catalog > table identifiers and this use will be the first break of this abstraction. IIUC, based on Jan's comments, we are not going to modify the table s

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
Thanks Micah, for the latter, I meant the type of denormalization of repeating a 3-part name as opposed to using an ID. On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield wrote: > However, this still does not address the semantic issue which is more >> fundamental in my opinion. The Iceberg table s

Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-16 Thread Sung Yun
Hi folks! We are 1 binding vote short of accepting this release candidate. The verification steps are very easy to follow and can be found here: https://py.iceberg.apache.org/verify-release/ Thank you all again for testing and verifying the release! Sung On Fri, Aug 16, 2024 at 12:39 PM Chinmay

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Micah Kornfield
> > That being said, I think the most important consideration for now is where > are the current maintainers / contributors to the variant type. If most of > them are already PMC members / committers on a project, it becomes a bit > easier. Otherwise if there isn't much overlap with a project's exi

Re: [DISCUSS] Guidelines for committing PRs

2024-08-16 Thread Micah Kornfield
Hi Walaa, > For the former, we could talk about avoiding conflict of interest as a way > of "maintaining trust". For the latter, we can state some examples that > clearly reflect conflict of interest with no ambiguity. For example, a > committer merging a large change that received minimal discuss

Re: [DISCUSS] Guidelines for committing PRs

2024-08-16 Thread Walaa Eldin Moustafa
Thanks Micha. It is clearer now. I have left some comments. Let us continue on the PR. On Fri, Aug 16, 2024 at 5:39 PM Micah Kornfield wrote: > Hi Walaa, > >> For the former, we could talk about avoiding conflict of interest as a >> way of "maintaining trust". For the latter, we can state some e

Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-16 Thread Honah J.
+1 (binding) - Validated signatures/checksum/license - Ran test with Python 3.11.9 Sorry for being late. Thanks Sung for running the release! Thanks everyone for contributing and testing! Best regards, Honah On Fri, Aug 16, 2024 at 5:04 PM Sung Yun wrote: > Hi folks! > > We are 1 binding vote

Re: [VOTE] Release Apache Iceberg Rust 0.3.0 RC1

2024-08-16 Thread Renjie Liu
+1 (binding) [*] Download links are valid. [*] Checksums and signatures Seems we miss how to verify page, and I follow instructions here: https://iceberg.apache.org/how-to-release/#validating-a-source-release-candidate [*] LICENSE/NOTICE files exist [*] No unexpected binary files [ ] All source fi

Re: [DISCUSS] Row Lineage Proposal

2024-08-16 Thread Péter Váry
Hi Russell, As discussed offline, this would be very hard to implement with the current Flink CDC write strategies. I think this is true for every streaming writers. For tracking the previous version of the row, the streaming writer would need to scan the table. It needs to be done for every reco