Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul

Hi,

Thanks Micah for clearly stating the requirements. I think this gives 
better clarity for the discussion.


It seems like we don't have a solution that satisfies all requirements 
at once. So we would need to choose which has the fewest drawbacks.


I would like to summarize the different drawbacks that came up in the 
discussion.


no lineage
+ refresh-state key = identifier
normalized lineage
+ refresh-state key = uuid
denormalized lineage
+ refresh-state key = uuid
- introduces catalog identifiers into table metadata (#4)
- query engine has to expand lineage at refresh time (happens anyway)
- recursive calls to catalog to expand lineage at read time (#2)
- fragile by requiring child views to have lineage field
	- update of materialized view version required if child view is updated 
(#5)


With identifiers as the refresh-state keys, the lineage is not strictly 
required and becomes an orthogonal proposal. That's why I left it out if 
the comparison.


In my opinion introducing catalog identifiers into the table metadata 
(requirement #4) is the least significant drawback as it is not a 
technical reason but more about semantics. Especially as the identifiers 
are not introduced into the table spec but are rather stored in the 
snapshot summary. That's why I'm in favor of using the catalog 
identifiers as the refresh-state keys.


Regarding your last point Walaa:

The option of using catalog identifiers in the state map still 
requires keeping lineage information in the view because REFRESH MV 
needs the latest fully expanded children (which could have changed 
from the set of children currently in the state map), 
without reparsing the view tree.


For a refresh operation the query engine has to parse the SQL and fully 
expand the lineage with it's children anyway.  So the lineage is not 
strictly required.


If I understand correctly, most of you are also in favor of using 
catalog identifiers + ref as the refresh-state keys and postponing the 
lineage proposal.


I hope that we can move the discussion forward.

Jan

On 16.08.24 08:07, Walaa Eldin Moustafa wrote:
The option of using catalog identifiers in the state map still 
requires keeping lineage information in the view because REFRESH MV 
needs the latest fully expanded children (which could have changed 
from the set of children currently in the state map), 
without reparsing the view tree. Therefore, catalog identifiers in the 
state map, does not eliminate the need for tracking children in the 
form of catalog identifiers in the lineage side (but in this case 
lineage will be a set instead of just a map).


Hence, my concerns with using catalog identifiers (as opposed to 
UUIDs) are:
* The fundamental issue where the table spec depends on/refers to the 
view spec (because such catalog identifiers are not defined in the 
table spec and the only place they have a meaning is in the view spec 
lineage information).
* (less fundamental) The denormalization introduced by this 
arrangement, where each identifier is 3-parts and all of them repeat 
in both lineage info and state map.


I am not very concerned with recursive expansion (through multiple 
calls), as it is always the case with views.


On a positive note, looks like we agree to move past sequence numbers :)

Thanks,
Walaa.




On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield 
 wrote:


I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer using identifier in the
refresh state.  If we use identifiers, we can directly
parallelize the catalog calls to fetch the latest state. If we
use UUID, the engine has to go back to the MV and possibly
additional views to reconstruct the lineage map. It's just a
lot slower and more work for the engine when there is a MV
that references a lot of views (and those views reference
additional views).


I'm +1 on using catalog identifiers as the key.  As you point out
this is inline with #2 (try to minimize serial catalog lookups) in
addition to supporting requirement #3.

On Thu, Aug 15, 2024 at 3:27 PM Benny Chow  wrote:

I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer using identifier in the
refresh state.  If we use identifiers, we can directly
parallelize the catalog calls to fetch the latest state.  If
we use UUID, the engine has to go back to the MV and possibly
additional views to reconstruct the lineage map.  It's just a
lot slower and more work for the engine when there is a MV
that references a lot of views (and those views reference
additional views).

Thanks
Benny


On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa
 wrote:

Thanks Jan, Micah, and Karuppayya for chiming in.

I do not think 3 and 4 are at odds with e

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul
As the table I created is not properly shown in the mailing list I'll 
reformat the summary of the different drawbacks again:


Drawbacks of (no lineage, refresh-state key = identifier):

- introduces catalog identifiers into table metadata (#4)
- query engine has to expand lineage at refresh time (happens anyway)

Drawbacks of (normalized lineage, refresh-state key = uuid):

- recursive calls to catalog to expand lineage at read time (#2)
- fragile by requiring child views to have lineage field

Drawbacks of (denormalized lineage, refresh-state key = uuid):

- update of materialized view version required if child view is updated (#5)

On 16.08.24 09:17, Jan Kaul wrote:


Hi,

Thanks Micah for clearly stating the requirements. I think this gives 
better clarity for the discussion.


It seems like we don't have a solution that satisfies all requirements 
at once. So we would need to choose which has the fewest drawbacks.


I would like to summarize the different drawbacks that came up in the 
discussion.


no lineage
+ refresh-state key = identifier
normalized lineage
+ refresh-state key = uuid
denormalized lineage
+ refresh-state key = uuid
- introduces catalog identifiers into table metadata (#4)
- query engine has to expand lineage at refresh time (happens anyway)
- recursive calls to catalog to expand lineage at read time (#2)
- fragile by requiring child views to have lineage field
	- update of materialized view version required if child view is 
updated (#5)


With identifiers as the refresh-state keys, the lineage is not 
strictly required and becomes an orthogonal proposal. That's why I 
left it out if the comparison.


In my opinion introducing catalog identifiers into the table metadata 
(requirement #4) is the least significant drawback as it is not a 
technical reason but more about semantics. Especially as the 
identifiers are not introduced into the table spec but are rather 
stored in the snapshot summary. That's why I'm in favor of using the 
catalog identifiers as the refresh-state keys.


Regarding your last point Walaa:

The option of using catalog identifiers in the state map still 
requires keeping lineage information in the view because REFRESH MV 
needs the latest fully expanded children (which could have changed 
from the set of children currently in the state map), 
without reparsing the view tree.


For a refresh operation the query engine has to parse the SQL and 
fully expand the lineage with it's children anyway.  So the lineage is 
not strictly required.


If I understand correctly, most of you are also in favor of using 
catalog identifiers + ref as the refresh-state keys and postponing the 
lineage proposal.


I hope that we can move the discussion forward.

Jan

On 16.08.24 08:07, Walaa Eldin Moustafa wrote:
The option of using catalog identifiers in the state map still 
requires keeping lineage information in the view because REFRESH MV 
needs the latest fully expanded children (which could have changed 
from the set of children currently in the state map), 
without reparsing the view tree. Therefore, catalog identifiers in 
the state map, does not eliminate the need for tracking children in 
the form of catalog identifiers in the lineage side (but in this case 
lineage will be a set instead of just a map).


Hence, my concerns with using catalog identifiers (as opposed to 
UUIDs) are:
* The fundamental issue where the table spec depends on/refers to the 
view spec (because such catalog identifiers are not defined in the 
table spec and the only place they have a meaning is in the view spec 
lineage information).
* (less fundamental) The denormalization introduced by this 
arrangement, where each identifier is 3-parts and all of them repeat 
in both lineage info and state map.


I am not very concerned with recursive expansion (through multiple 
calls), as it is always the case with views.


On a positive note, looks like we agree to move past sequence numbers :)

Thanks,
Walaa.




On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield 
 wrote:


I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer using identifier in the
refresh state.  If we use identifiers, we can directly
parallelize the catalog calls to fetch the latest state.  If
we use UUID, the engine has to go back to the MV and possibly
additional views to reconstruct the lineage map.  It's just a
lot slower and more work for the engine when there is a MV
that references a lot of views (and those views reference
additional views).


I'm +1 on using catalog identifiers as the key.  As you point out
this is inline with #2 (try to minimize serial catalog lookups)
in addition to supporting requirement #3.

On Thu, Aug 15, 2024 at 3:27 PM Benny Chow  wrote:

I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer 

Re: [DISCUSS] Changing namespace separator in REST spec

2024-08-16 Thread Eduard Tudenhöfner
I do want to remind us that the original issue
 was reported by users from
the community before we were internally aware of this issue, meaning that
there are users of the V1 API that are either running into this issue or
will eventually run into this issue when they upgrade their server stack.

For the users of the V1 API we have a simple path forward (make it
*configurable* what's currently *hardcoded*) so I do not see a reason why
we shouldn't proceed and fix this for these users. I do not think that we
should wait to fix this problem until we have V2 APIs.

The configuration part is *entirely optional* for REST server implementers
and there's no behavioral change for existing installations.

Also it seems we have general consensus on the approach from at least 6
people (me included).

In the meantime the Iceberg community can focus on how a V2 API would look
and what potential escaping mechanisms would be necessary to solve the
general problem.

Thanks everyone and I'll go ahead and start a VOTE thread to update the
wording in the REST spec in #10877
.

Eduard


On Wed, Aug 14, 2024 at 6:46 PM Robert Stupp  wrote:

> I do object the plan to make that separation character configurable,
> because there is no need to do this (see below), so there's no need for any
> code change. Instead having a proper, mostly human readable escaping
> mechanism, which is possible, should be implemented with Iceberg REST v2.
>
> Here's the relevant part of the Servlet Spec v6 - first paragraph of [1]:
> "Servlet containers may implement the standard Servlet URI path
> canonicalization in any manner they see fit as long as the end result is
> identical to the end result of the process described here. Servlet
> containers may provide container specific configuration options to vary the
> standard canonicalization process. Any such variations may have security
> implications and both Servlet container implementors and users are advised
> to be sure that they understand the implications of any such container
> specific canonicalization options.
>
> There's no mention of "must" - the servlet spec carefully uses "may". Just
> 2-3 letters, but a huge difference, which does *not* mean that e.g. '%1F'
> has to be rejected, hence Jetty allows users to disable this behavior.
>
> I think, I mentioned my concerns already, but here's a summary:
>
> * There is no spec that forces this change.
>
> * The proposed change only tackles ASCII unit separator. The proposed
> change does NOT tackle other characters like '%' or '/' or '\'.
>
> * The proposed change adds yet another configuration option - there are
> already quite a lot. Getting that config wrong (and humans make mistakes,
> because they are not always aware or the consequences) leads to all kinds
> of issues.
>
> * Technical: The proposed change has no validation that the "separator" is
> a single character.
>
> * Technical: Allowing clients to configure such separator character opens
> the door for inconsistency due to protocol level discrepancies.
>
> * Using servlet spec 6 with the ambiguous/suspiciuous validation breaks
> all currently released clients.
>
>
> There are catalogs that do allow this via Spark SQL:
>   CREATE NAMESPACE my_catalog.`my/ugly\name%space.!`
>
>
> On 14.08.24 12:40, Eduard Tudenhöfner wrote:
>
>
> Keeping the Iceberg REST spec as is and configuring the (affected)
>> services that they accept the '%1F' separator character seems much easier
>> and less controversial and eliminates the need for these changes entirely.
>
>
> I do not think that this is a viable option for everyone. Such control
> characters are being rejected by the new Servlet spec for security reasons
> so not every REST server would want to open security holes by enabling
> something that the Servlet 6 spec is specifically trying to avoid.
> Here's some additional context on the topic of the *Servlet 6 rules on
> Ambiguous URIs *from the Jetty project
> 
> :
>
> Servlet 6 (Jetty 12 / ee10) clarified a ton of vague language and
>> behaviors around a bunch of core concepts.
>> Back in Servlet 5 (Jetty 11 or Jetty 12 / ee9) these were fundamentally
>> broken and lead to a long array of security issues and broken libraries
>> that worked with Servlet 5.
>
>
>
> [1]
> https://jakarta.ee/specifications/servlet/6.0/jakarta-servlet-spec-6.0.html#uri-path-canonicalization
>


[VOTE] Make namespace separator configurable in REST Spec

2024-08-16 Thread Eduard Tudenhöfner
Hey everyone,

as I mentioned on the DISCUSS thread, this is providing a simple path
forward for users of the V1 APIs (make the namespace separator
*configurable* instead of *hardcoded*) that are either running into issue
#10338  or will eventually
when they upgrade their server stack.

The configuration part is *entirely optional* for REST server implementers
and there's no behavioral change for existing installations.

Please vote on the wording changes in the REST Spec in #10877
.

The vote will remain open for at least 72 hours.

Thanks
Eduard


Re: [DISCUSS] REST Endpoint discovery

2024-08-16 Thread Eduard Tudenhöfner
If we really want to add more structure to the JSON representation, then I
would probably prefer what Dmitri suggested in the doc as I think { "GET": {
}, "POST": {} } looks a bit weird:

"endpoints":[
  {"verb": "GET","path": "/v1/{prefix}/namespaces/{namespace}"},
  {"verb": "GET","path": "/v1/{prefix}/namespaces"},
  {"verb": "POST","path": "/v1/{prefix}/namespaces"}
]

What do people think of that?



On Fri, Aug 16, 2024 at 8:13 AM Walaa Eldin Moustafa 
wrote:

> Thank you Eduard for sharing this version of the proposal. Looks simple,
> functional, and extensible.
>
> On Thu, Aug 15, 2024 at 1:10 PM Ryan Blue 
> wrote:
>
>> I think I'm fine either way. I lean toward the simplicity of the strings
>> in the proposal but would not complain if we went with Yufei's suggestion.
>>
>> On Thu, Aug 15, 2024 at 12:12 PM Yufei Gu  wrote:
>>
>>> The current proposal lists endpoints as plain strings, and I still
>>> believe we could make things a bit smoother by adding some structure to
>>> them. Here's the example if the previous one throws you off.
>>>
>>> *Before:*
>>>
>>> "GET /v1/{prefix}/namespaces","POST /v1/{prefix}/namespaces","GET 
>>> /v1/{prefix}/namespaces/{namespace}","DELETE 
>>> /v1/{prefix}/namespaces/{namespace}"
>>>
>>> *After:*
>>>
>>> {"/v1/{prefix}/namespaces": { "GET": {}, "POST": {} } 
>>> },{"/v1/{prefix}/namespaces/{namespace}": { "GET": {}, "DELETE": {} } }
>>>
>>> I think this approach would make the definitions more machine-readable
>>> and easier to expand in the future. It's always good to plan for potential
>>> changes, and this structure should help us adapt without much hassle.
>>> Plus, we could use a trimmed OpenAPI schema for this, so we wouldn’t
>>> need to create a new schema from scratch.
>>>
>>> Yufei
>>>
>>>
>>> On Thu, Aug 15, 2024 at 11:02 AM Jack Ye  wrote:
>>>
 > But I propose to use a trimmed openAPI's format directly.
 Looking at the example, this feels quite complicated to me.

 > For example, it is easier if we want to include operationID
 I don't think we need to consider accommodating both, since operationId
 is an alternative to " ".

 > or adding feature flags, or adding parameters
 For more advanced feature flags, those could likely directly go to
 overrides just like today, but given we limited the scope to service
 endpoint discovery, I think that is another discussion later.

 So the current proposed solution from Eduard still feels better to me.
 And I think the argument for ambiguity is pretty strong so I am good with
 the proposed approach to use " ".

 -Jack





 On Thu, Aug 15, 2024 at 9:46 AM Yufei Gu  wrote:

> +1 for the proposal. In terms of the format, the current solution is
> simple enough. But I propose to use a trimmed openAPI's format directly. 
> It
> won't add much cost as we can just take the minimum fields we want. But it
> opens a window to extend it in the future. For example, it is easier if we
> want to include operationID, or adding feature flags, or adding 
> parameters.
> Here is an example:
>
> {
>
>   "resources": [
>
> {
>
>   "/v1/{prefix}/namespaces":
>
> {
>
>   "GET": {
>
>   "description": "List all namespaces"
>
>   }
>
> },
>
> {
>
>   "POST": {
>
>   "description": "Create a new namespace"
>
>   }
>
> }
>
> },
>
> {
>
>   "path2": {}
>
> }
>
>   ...
>
>   ]
>
> }
>
>
> Yufei
>
>
> On Thu, Aug 15, 2024 at 8:47 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> I'm on board for this proposal. I was in the off-mail chats and I
>> think this is probably our simplest approach going forward.
>>
>> On Thu, Aug 15, 2024 at 10:39 AM Dmitri Bourlatchkov
>>  wrote:
>>
>>> OpenAPI tool will WARN a lot if Operation IDs overlap. Generated
>>> code/html may also look odd in case of overlaps.
>>>
>>> All-in-all, I think the best practice is to define unique Operation
>>> IDs up front.
>>>
>>> For Iceberg REST API, the yaml file is the API definition, so it
>>> should not be a problem to ensure that Operation IDs are unique, I 
>>> guess.
>>>
>>> Cheers,
>>> Dmitri.
>>>
>>> On Thu, Aug 15, 2024 at 11:32 AM Eduard Tudenhöfner <
>>> etudenhoef...@apache.org> wrote:
>>>
 Hey Jack,

 thanks for the feedback. I replied in the doc but I can reiterate
 my answer here too: The *path* is unique and required so that
 feels more appropriate than requiring to have an optional
 *operationId* in the OpenAPI spec.
 Additionally, using the path is more straight-forward when we

Re: Iceberg-arrow vectorized read bug

2024-08-16 Thread Eduard Tudenhöfner
Hey Steve,

It's been a long time since I did some work on the iceberg-arrow module but
I will try to find some time next week to analyze the problem in detail and
see what options we have for fixing it.

Thanks for your patience here.

Eduard

On Mon, Aug 12, 2024 at 9:00 PM Lessard, Steve
 wrote:

> Iceberg-arrow is a non-trivial chunk of code that appears to me as though it 
> has no current owner. Based on the history I find in GitHub it appears the 
> original author of the org.apache.iceberg.arrow.vectorized package, Mayur 
> Srivastava, wrote this package as his one and only contribution to any public 
> repository in GitHub.com and hasn’t been seen since.  That was back in April 
> of 2021 
> .
>  Now I need to fix a bug in this package and cannot find anyone to mentor me 
> through changing this complex package.
>
>
>
> I’ve taken yet another look at this bug in the vectorized Arrow reader code, 
> issue #10275 . As a summary 
> the issue is that vectorized reads fail when the parquet file does not 
> contain a column that exists within the Iceberg schema. This is a schema 
> evolution issue. I spent several hours studying the code, learning where and 
> how readers for each column are created, how the vector holder for each 
> column is produced, and how the value for each column is consumed via an 
> ArrowVectorAccessor instance. I have concluded that this isn’t really a bug 
> so much as a never implemented feature. There are two main reasons for my 
> conclusion:
>
> 1.  The code currently thows a NullPointerException when trying to read a 
> column that exists in the Iceberg schema but does not exist in the parquet 
> schema
>
> 2.  This comment in the source code for ArrowBatchReader.java line 53 
> ,
>  “// Handle null vector for constant case.” I believe this is a TODO comment.
>
>
>
> So, what’s the path forward? At a minimum, and as a first step, I believe my 
> original PR for this issue should be approved and merged. That PR is 
> https://github.com/apache/iceberg/pull/10284. That PR simply prevents 
> throwing a NullPointerException and instead throws an 
> UnsupportedOperationException (which is the same exception thrown as the 
> default case for trying to read any unsupported column type.)
>
>
>
> As I wrote earlier that could be just a first step. What could be the next 
> step? Well, I drafted two hacks of the source code to help me explore and 
> hopefully inform a potential solution. I want to emphasize that the pull 
> requests are hacks and not intended nor proposed to be merged into Apache 
> Iceberg.
>
>
>
> First, I wrote this hack which explores adding a reader for the missing 
> column. https://github.com/apache/iceberg/pull/10922 Ultimately this solution 
> failed because, though I hacked together a reader, there was nothing to read 
> inside the parquet file. The meat of this change is in 
> arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java.
>
>
>
> Second, I write this hack which changes the dummy holder currently used for 
> null values. https://github.com/apache/iceberg/pull/10923 This hack 
> ultimately fails for two reasons: 1) The VectorHolder instance I created is 
> hacked together with hard coded information that just isn’t available at this 
> point in the call stack. 2) I don’t know how to make anything consume the 
> NullAccessor class I created to access the null value from the NullVector 
> instance.
>
>
>
> I am hoping two things will come out of this email message:
>
> 1.  My original pull request will be approved and merged. 
> https://github.com/apache/iceberg/pull/10284
>
> 2.  One or both of my hacks will inform / inspire someone on this list to 
> dream up a better solution and mentor me so that I may code up that solution 
> and further explore it.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From: *Lessard, Steve 
> *Date: *Thursday, August 1, 2024 at 4:38 PM
> *To: *dev@iceberg.apache.org , Amogh Jahagirdar <
> 2am...@gmail.com>
> *Cc: *dev@iceberg.apache.org 
> *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> Hi Amogh, Do you think you could have another look at this issue or point
> me to someone who might be able to help me identify to the root cause and
> the correct fix?
>
>
>
>
>
>
>
> *From: *Lessard, Steve 
> *Date: *Monday, July 29, 2024 at 5:46 PM
> *To: *dev@iceberg.apache.org , Amogh Jahagirdar <
> 2am...@gmail.com>
> *Cc: *dev@iceberg.apache.org 
> *Subject: *Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
>
> Adding Amogh Jahagirdar to the To: line…
>
>
>
>
>
> *From: *Lessard, Steve 
> *Date: *Monday, July 29, 2024 at 1:12 PM
> *To: *dev@iceberg.apache.org ,
> steve.less...@terada

Re: [DISCUSS] REST Endpoint discovery

2024-08-16 Thread Dmitri Bourlatchkov
Sorry for a bit of back-tracking, but I'd like to clarify the meaning of
those endpoint strings.

I initially assumed that the strings would need to be parsed into
components (verb / path) for use in runtime. My suggestion for using a JSON
representation was meant to make the parsing more standard (i.e. avoid
custom format).

After reading the impl. PR [1] I think parsing is actually _not_ necessary.
The whole endpoint string is basically another form of an endpoint "ID".

If we agree that the values in the new endpoints config list should be
treated as opaque IDs, I think it should be fine to use the simple string
representation.

WDYT?

Thanks,
Dmitri.

[1] https://github.com/apache/iceberg/pull/10929

On Fri, Aug 16, 2024 at 4:43 AM Eduard Tudenhöfner 
wrote:

> If we really want to add more structure to the JSON representation, then I
> would probably prefer what Dmitri suggested in the doc as I think { "GET":
> {}, "POST": {} } looks a bit weird:
>
> "endpoints":[
>   {"verb": "GET","path": "/v1/{prefix}/namespaces/{namespace}"},
>   {"verb": "GET","path": "/v1/{prefix}/namespaces"},
>   {"verb": "POST","path": "/v1/{prefix}/namespaces"}
> ]
>
> What do people think of that?
>
>
>
> On Fri, Aug 16, 2024 at 8:13 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Thank you Eduard for sharing this version of the proposal. Looks simple,
>> functional, and extensible.
>>
>> On Thu, Aug 15, 2024 at 1:10 PM Ryan Blue 
>> wrote:
>>
>>> I think I'm fine either way. I lean toward the simplicity of the strings
>>> in the proposal but would not complain if we went with Yufei's suggestion.
>>>
>>> On Thu, Aug 15, 2024 at 12:12 PM Yufei Gu  wrote:
>>>
 The current proposal lists endpoints as plain strings, and I still
 believe we could make things a bit smoother by adding some structure to
 them. Here's the example if the previous one throws you off.

 *Before:*

 "GET /v1/{prefix}/namespaces","POST /v1/{prefix}/namespaces","GET 
 /v1/{prefix}/namespaces/{namespace}","DELETE 
 /v1/{prefix}/namespaces/{namespace}"

 *After:*

 {"/v1/{prefix}/namespaces": { "GET": {}, "POST": {} } 
 },{"/v1/{prefix}/namespaces/{namespace}": { "GET": {}, "DELETE": {} } }

 I think this approach would make the definitions more machine-readable
 and easier to expand in the future. It's always good to plan for potential
 changes, and this structure should help us adapt without much hassle.
 Plus, we could use a trimmed OpenAPI schema for this, so we wouldn’t
 need to create a new schema from scratch.

 Yufei


 On Thu, Aug 15, 2024 at 11:02 AM Jack Ye  wrote:

> > But I propose to use a trimmed openAPI's format directly.
> Looking at the example, this feels quite complicated to me.
>
> > For example, it is easier if we want to include operationID
> I don't think we need to consider accommodating both, since
> operationId is an alternative to "  spec>".
>
> > or adding feature flags, or adding parameters
> For more advanced feature flags, those could likely directly go to
> overrides just like today, but given we limited the scope to service
> endpoint discovery, I think that is another discussion later.
>
> So the current proposed solution from Eduard still feels better to me.
> And I think the argument for ambiguity is pretty strong so I am good with
> the proposed approach to use " ".
>
> -Jack
>
>
>
>
>
> On Thu, Aug 15, 2024 at 9:46 AM Yufei Gu  wrote:
>
>> +1 for the proposal. In terms of the format, the current solution is
>> simple enough. But I propose to use a trimmed openAPI's format directly. 
>> It
>> won't add much cost as we can just take the minimum fields we want. But 
>> it
>> opens a window to extend it in the future. For example, it is easier if 
>> we
>> want to include operationID, or adding feature flags, or adding 
>> parameters.
>> Here is an example:
>>
>> {
>>
>>   "resources": [
>>
>> {
>>
>>   "/v1/{prefix}/namespaces":
>>
>> {
>>
>>   "GET": {
>>
>>   "description": "List all namespaces"
>>
>>   }
>>
>> },
>>
>> {
>>
>>   "POST": {
>>
>>   "description": "Create a new namespace"
>>
>>   }
>>
>> }
>>
>> },
>>
>> {
>>
>>   "path2": {}
>>
>> }
>>
>>   ...
>>
>>   ]
>>
>> }
>>
>>
>> Yufei
>>
>>
>> On Thu, Aug 15, 2024 at 8:47 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I'm on board for this proposal. I was in the off-mail chats and I
>>> think this is probably our simplest approach going forward.
>>>
>>> On Thu, Aug 15, 2024 at

Re: [VOTE] Make namespace separator configurable in REST Spec

2024-08-16 Thread Dmitri Bourlatchkov
+1 (nb) to the spec change.

Cheers,
Dmitri.

On Fri, Aug 16, 2024 at 4:31 AM Eduard Tudenhöfner 
wrote:

> Hey everyone,
>
> as I mentioned on the DISCUSS thread, this is providing a simple path
> forward for users of the V1 APIs (make the namespace separator
> *configurable* instead of *hardcoded*) that are either running into issue
> #10338  or will
> eventually when they upgrade their server stack.
>
> The configuration part is *entirely optional* for REST server
> implementers and there's no behavioral change for existing installations.
>
> Please vote on the wording changes in the REST Spec in #10877
> .
>
> The vote will remain open for at least 72 hours.
>
> Thanks
> Eduard
>
>


Re: [DISCUSS] REST Endpoint discovery

2024-08-16 Thread karuppayya
Can we consider adding `OPTIONS` verb to the resource paths, as part of the
spec?
That way the endpoint discovery endpoint could return only the list of
supported endpoints, without the verbs.

`OPTIONS` on the resource path can return the list of all supported verbs,
and also other information regarding API usage

- Karuppayya

On Fri, Aug 16, 2024 at 7:34 AM Dmitri Bourlatchkov
 wrote:

> Sorry for a bit of back-tracking, but I'd like to clarify the meaning of
> those endpoint strings.
>
> I initially assumed that the strings would need to be parsed into
> components (verb / path) for use in runtime. My suggestion for using a JSON
> representation was meant to make the parsing more standard (i.e. avoid
> custom format).
>
> After reading the impl. PR [1] I think parsing is actually _not_
> necessary. The whole endpoint string is basically another form of an
> endpoint "ID".
>
> If we agree that the values in the new endpoints config list should be
> treated as opaque IDs, I think it should be fine to use the simple string
> representation.
>
> WDYT?
>
> Thanks,
> Dmitri.
>
> [1] https://github.com/apache/iceberg/pull/10929
>
> On Fri, Aug 16, 2024 at 4:43 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> If we really want to add more structure to the JSON representation, then
>> I would probably prefer what Dmitri suggested in the doc as I think {
>> "GET": {}, "POST": {} } looks a bit weird:
>>
>> "endpoints":[
>>   {"verb": "GET","path": "/v1/{prefix}/namespaces/{namespace}"},
>>   {"verb": "GET","path": "/v1/{prefix}/namespaces"},
>>   {"verb": "POST","path": "/v1/{prefix}/namespaces"}
>> ]
>>
>> What do people think of that?
>>
>>
>>
>> On Fri, Aug 16, 2024 at 8:13 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Thank you Eduard for sharing this version of the proposal. Looks simple,
>>> functional, and extensible.
>>>
>>> On Thu, Aug 15, 2024 at 1:10 PM Ryan Blue 
>>> wrote:
>>>
 I think I'm fine either way. I lean toward the simplicity of the
 strings in the proposal but would not complain if we went with Yufei's
 suggestion.

 On Thu, Aug 15, 2024 at 12:12 PM Yufei Gu  wrote:

> The current proposal lists endpoints as plain strings, and I still
> believe we could make things a bit smoother by adding some structure to
> them. Here's the example if the previous one throws you off.
>
> *Before:*
>
> "GET /v1/{prefix}/namespaces","POST /v1/{prefix}/namespaces","GET 
> /v1/{prefix}/namespaces/{namespace}","DELETE 
> /v1/{prefix}/namespaces/{namespace}"
>
> *After:*
>
> {"/v1/{prefix}/namespaces": { "GET": {}, "POST": {} } 
> },{"/v1/{prefix}/namespaces/{namespace}": { "GET": {}, "DELETE": {} } }
>
> I think this approach would make the definitions more machine-readable
> and easier to expand in the future. It's always good to plan for potential
> changes, and this structure should help us adapt without much hassle.
> Plus, we could use a trimmed OpenAPI schema for this, so we wouldn’t
> need to create a new schema from scratch.
>
> Yufei
>
>
> On Thu, Aug 15, 2024 at 11:02 AM Jack Ye  wrote:
>
>> > But I propose to use a trimmed openAPI's format directly.
>> Looking at the example, this feels quite complicated to me.
>>
>> > For example, it is easier if we want to include operationID
>> I don't think we need to consider accommodating both, since
>> operationId is an alternative to " > spec>".
>>
>> > or adding feature flags, or adding parameters
>> For more advanced feature flags, those could likely directly go to
>> overrides just like today, but given we limited the scope to service
>> endpoint discovery, I think that is another discussion later.
>>
>> So the current proposed solution from Eduard still feels better to
>> me. And I think the argument for ambiguity is pretty strong so I am good
>> with the proposed approach to use " > spec>".
>>
>> -Jack
>>
>>
>>
>>
>>
>> On Thu, Aug 15, 2024 at 9:46 AM Yufei Gu 
>> wrote:
>>
>>> +1 for the proposal. In terms of the format, the current solution is
>>> simple enough. But I propose to use a trimmed openAPI's format 
>>> directly. It
>>> won't add much cost as we can just take the minimum fields we want. But 
>>> it
>>> opens a window to extend it in the future. For example, it is easier if 
>>> we
>>> want to include operationID, or adding feature flags, or adding 
>>> parameters.
>>> Here is an example:
>>>
>>> {
>>>
>>>   "resources": [
>>>
>>> {
>>>
>>>   "/v1/{prefix}/namespaces":
>>>
>>> {
>>>
>>>   "GET": {
>>>
>>>   "description": "List all namespaces"
>>>
>>>   }
>>>
>>> },
>>>
>>> {
>>>
>>>   "

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
Thanks Jan for the summary.

For this point:

> For a refresh operation the query engine has to parse the SQL and fully
expand the lineage with it's children anyway.  So the lineage is not
strictly required.

If the lineage is provided at creation time by the respective engine, the
refresh operation does not need to parse the SQL, correct?

Thanks,
Walaa.

On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
wrote:

> As the table I created is not properly shown in the mailing list I'll
> reformat the summary of the different drawbacks again:
>
> Drawbacks of (no lineage, refresh-state key = identifier):
>
> - introduces catalog identifiers into table metadata (#4)
> - query engine has to expand lineage at refresh time (happens anyway)
>
> Drawbacks of (normalized lineage, refresh-state key = uuid):
>
> - recursive calls to catalog to expand lineage at read time (#2)
> - fragile by requiring child views to have lineage field
>
> Drawbacks of (denormalized lineage, refresh-state key = uuid):
>
> - update of materialized view version required if child view is updated
> (#5)
> On 16.08.24 09:17, Jan Kaul wrote:
>
> Hi,
>
> Thanks Micah for clearly stating the requirements. I think this gives
> better clarity for the discussion.
>
> It seems like we don't have a solution that satisfies all requirements at
> once. So we would need to choose which has the fewest drawbacks.
>
> I would like to summarize the different drawbacks that came up in the
> discussion.
> no lineage
> + refresh-state key = identifier
> normalized lineage
> + refresh-state key = uuid
> denormalized lineage
> + refresh-state key = uuid
> - introduces catalog identifiers into table metadata (#4)
> - query engine has to expand lineage at refresh time (happens anyway)
> - recursive calls to catalog to expand lineage at read time (#2)
> - fragile by requiring child views to have lineage field
> - update of materialized view version required if child view is updated
> (#5)
>
> With identifiers as the refresh-state keys, the lineage is not strictly
> required and becomes an orthogonal proposal. That's why I left it out if
> the comparison.
>
> In my opinion introducing catalog identifiers into the table metadata
> (requirement #4) is the least significant drawback as it is not a technical
> reason but more about semantics. Especially as the identifiers are not
> introduced into the table spec but are rather stored in the snapshot
> summary. That's why I'm in favor of using the catalog identifiers as the
> refresh-state keys.
>
> Regarding your last point Walaa:
>
> The option of using catalog identifiers in the state map still requires
> keeping lineage information in the view because REFRESH MV needs the latest
> fully expanded children (which could have changed from the set of children
> currently in the state map), without reparsing the view tree.
>
> For a refresh operation the query engine has to parse the SQL and fully
> expand the lineage with it's children anyway.  So the lineage is not
> strictly required.
>
> If I understand correctly, most of you are also in favor of using catalog
> identifiers + ref as the refresh-state keys and postponing the lineage
> proposal.
>
> I hope that we can move the discussion forward.
>
> Jan
> On 16.08.24 08:07, Walaa Eldin Moustafa wrote:
>
> The option of using catalog identifiers in the state map still requires
> keeping lineage information in the view because REFRESH MV needs the latest
> fully expanded children (which could have changed from the set of children
> currently in the state map), without reparsing the view tree. Therefore,
> catalog identifiers in the state map, does not eliminate the need for
> tracking children in the form of catalog identifiers in the lineage side
> (but in this case lineage will be a set instead of just a map).
>
> Hence, my concerns with using catalog identifiers (as opposed to UUIDs)
> are:
> * The fundamental issue where the table spec depends on/refers to the view
> spec (because such catalog identifiers are not defined in the table spec
> and the only place they have a meaning is in the view spec lineage
> information).
> * (less fundamental) The denormalization introduced by this arrangement,
> where each identifier is 3-parts and all of them repeat in both lineage
> info and state map.
>
> I am not very concerned with recursive expansion (through multiple calls),
> as it is always the case with views.
>
> On a positive note, looks like we agree to move past sequence numbers :)
>
> Thanks,
> Walaa.
>
>
>
>
> On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield 
> wrote:
>
>> I think given the constraint that catalog lookup has to be by identifier
>>> and not UUID, I'd prefer using identifier in the refresh state.  If we use
>>> identifiers, we can directly parallelize the catalog calls to fetch the
>>> latest state.  If we use UUID, the engine has to go back to the MV and
>>> possibly additional views to reconstruct the lineage map.  It's just a lot
>>> slower and more work fo

Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-16 Thread Daniel Weeks
Thanks Sung!

I agree with the comments that this doesn't require a new RC.

+1 (binding)

Verified sigs/sums/license/build/test with Python 3.11.9

Thanks,
-Dan

On Thu, Aug 15, 2024 at 3:34 PM Sung Yun  wrote:

> Hi Daniel, thank you very much for testing the installation thoroughly and
> reporting these issues.
>
> We make note of the supported Python versions using the PyPi classifiers
> ,
> but I agree that we should include the restriction in the installation
> requirements as well.
>
> Here's a PR with an attempt to fix the this issue, and the issue with
> docutils (0.21.post1): https://github.com/apache/iceberg-python/pull/1067
>
> On Thu, Aug 15, 2024 at 5:35 PM Daniel Weeks  wrote:
>
>> I ran into a couple issues while trying to verify the release.
>>
>> The first appears to be a transient issue (we ran into something similar
>> in the 0.6.1 release but I was able to install later).
>>
>> Package docutils (0.21.post1) not found.
>> make: *** [install-dependencies] Error 1
>>
>> The second issue is more concerning to me because I can't install
>> dependencies with Python 3.12.4 as I get the following:
>>
>>   - Installing numpy (1.24.4): Failed
>>
>>   ChefBuildError
>>
>>   Backend 'setuptools.build_meta:__legacy__' is not available.
>>
>>   Cannot import 'setuptools.build_meta'
>>
>>   at venv/lib/python3.12/site-packages/poetry/installation/chef.py:164 in
>> _prepare
>>   160│
>>   161│ error =
>> ChefBuildError("\n\n".join(message_parts))
>>   162│
>>   163│ if error is not None:
>> → 164│ raise error from None
>>   165│
>>   166│ return path
>>   167│
>>   168│ def _prepare_sdist(self, archive: Path, destination: Path
>> | None = None) -> Path:
>>
>> Note: This error originates from the build backend, and is likely not a
>> problem with poetry but with numpy (1.24.4) not supporting PEP 517 builds.
>> You can verify this by running 'pip wheel --no-cache-dir --use-pep517
>> "numpy (==1.24.4)"'.
>>
>> I was able to verify everything with 3.11 however, but I haven't seen
>> anything that would indicate we don't support 3.12.x
>>
>> -Dan
>>
>> On Wed, Aug 14, 2024 at 7:14 PM Kevin Liu  wrote:
>>
>>> +1 (non-binding)
>>> Verified signatures/checksums/licenses. Ran unit and integration tests.
>>>
>>> On Thu, Aug 15, 2024 at 2:42 AM Fokko Driesprong 
>>> wrote:
>>>
 +1 (binding)

 Thanks Sung for running this 🙌

 - Validated signatures/checksums/license
 - Ran some basic tests (3.10)

 Kind regards,
 Fokko

 Op wo 14 aug 2024 om 19:57 schreef André Luis Anastácio
 :

>
>- validated signatures and checksums
>
>
>- checked license
>
>
>- ran tests and test-coverage with Python 3.9.12
>
>
> +1 (non-binding)
>
> André Anastácio
>
> On Tuesday, August 13th, 2024 at 10:19 PM, Sung Yun <
> sungwy...@gmail.com> wrote:
>
> Hi Everyone,
>
> I propose that we release the following RC as the official PyIceberg
> 0.7.1 release.
>
> A summary of the high level features:
>
> * Fix `delete` to trace existing manifests when a data file is
> partially rewritten
> 
> * Fix 'to_arrow_batch_reader' to respect the limit input arg
> 
> * Fix correctness of applying positional deletes on Merge-On-Read
> tables 
> * Fix overwrite when filtering data
> 
> * Bug fix for deletes across multiple partition specs on partition
> evolution 
> * Fix evolving the table and writing in the same transaction
> 
> * Fix scans when result is empty
> 
> * Fix ListNamespace response in REST Catalog
> 
> * Exclude Python 3.9.7 from list of supported versions
> 
> * Allow setting write.parquet.row-group-limit
> 
> * Allow setting write.parquet.page-row-limit
> 
> 
> * Fix pydantic warning during commit
> 
>
> The commit ID is f92994e85e526502a620506b964665b9afd385fe
>
> * This corresponds to the tag: pyiceberg-0.7.1rc2
> (d33192a3f64e1b5840c691b24a6071768a9fc79b)
> *
> https://github.com/apache/ice

Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-16 Thread Chinmay Bhat
+1 (non-binding)

- Verified signatures, checksums, license
- Ran unit tests & test-coverage with Python 3.9.19

Best,
Chinmay

On Fri, Aug 16, 2024 at 10:02 PM Daniel Weeks  wrote:

> Thanks Sung!
>
> I agree with the comments that this doesn't require a new RC.
>
> +1 (binding)
>
> Verified sigs/sums/license/build/test with Python 3.11.9
>
> Thanks,
> -Dan
>
> On Thu, Aug 15, 2024 at 3:34 PM Sung Yun  wrote:
>
>> Hi Daniel, thank you very much for testing the installation thoroughly
>> and reporting these issues.
>>
>> We make note of the supported Python versions using the PyPi classifiers
>> ,
>> but I agree that we should include the restriction in the installation
>> requirements as well.
>>
>> Here's a PR with an attempt to fix the this issue, and the issue with
>> docutils (0.21.post1): https://github.com/apache/iceberg-python/pull/1067
>>
>> On Thu, Aug 15, 2024 at 5:35 PM Daniel Weeks  wrote:
>>
>>> I ran into a couple issues while trying to verify the release.
>>>
>>> The first appears to be a transient issue (we ran into something similar
>>> in the 0.6.1 release but I was able to install later).
>>>
>>> Package docutils (0.21.post1) not found.
>>> make: *** [install-dependencies] Error 1
>>>
>>> The second issue is more concerning to me because I can't install
>>> dependencies with Python 3.12.4 as I get the following:
>>>
>>>   - Installing numpy (1.24.4): Failed
>>>
>>>   ChefBuildError
>>>
>>>   Backend 'setuptools.build_meta:__legacy__' is not available.
>>>
>>>   Cannot import 'setuptools.build_meta'
>>>
>>>   at venv/lib/python3.12/site-packages/poetry/installation/chef.py:164
>>> in _prepare
>>>   160│
>>>   161│ error =
>>> ChefBuildError("\n\n".join(message_parts))
>>>   162│
>>>   163│ if error is not None:
>>> → 164│ raise error from None
>>>   165│
>>>   166│ return path
>>>   167│
>>>   168│ def _prepare_sdist(self, archive: Path, destination: Path
>>> | None = None) -> Path:
>>>
>>> Note: This error originates from the build backend, and is likely not a
>>> problem with poetry but with numpy (1.24.4) not supporting PEP 517 builds.
>>> You can verify this by running 'pip wheel --no-cache-dir --use-pep517
>>> "numpy (==1.24.4)"'.
>>>
>>> I was able to verify everything with 3.11 however, but I haven't seen
>>> anything that would indicate we don't support 3.12.x
>>>
>>> -Dan
>>>
>>> On Wed, Aug 14, 2024 at 7:14 PM Kevin Liu 
>>> wrote:
>>>
 +1 (non-binding)
 Verified signatures/checksums/licenses. Ran unit and integration tests.

 On Thu, Aug 15, 2024 at 2:42 AM Fokko Driesprong 
 wrote:

> +1 (binding)
>
> Thanks Sung for running this 🙌
>
> - Validated signatures/checksums/license
> - Ran some basic tests (3.10)
>
> Kind regards,
> Fokko
>
> Op wo 14 aug 2024 om 19:57 schreef André Luis Anastácio
> :
>
>>
>>- validated signatures and checksums
>>
>>
>>- checked license
>>
>>
>>- ran tests and test-coverage with Python 3.9.12
>>
>>
>> +1 (non-binding)
>>
>> André Anastácio
>>
>> On Tuesday, August 13th, 2024 at 10:19 PM, Sung Yun <
>> sungwy...@gmail.com> wrote:
>>
>> Hi Everyone,
>>
>> I propose that we release the following RC as the official PyIceberg
>> 0.7.1 release.
>>
>> A summary of the high level features:
>>
>> * Fix `delete` to trace existing manifests when a data file is
>> partially rewritten
>> 
>> * Fix 'to_arrow_batch_reader' to respect the limit input arg
>> 
>> * Fix correctness of applying positional deletes on Merge-On-Read
>> tables 
>> * Fix overwrite when filtering data
>> 
>> * Bug fix for deletes across multiple partition specs on partition
>> evolution 
>> * Fix evolving the table and writing in the same transaction
>> 
>> * Fix scans when result is empty
>> 
>> * Fix ListNamespace response in REST Catalog
>> 
>> * Exclude Python 3.9.7 from list of supported versions
>> 
>> * Allow setting write.parquet.row-group-limit
>> 
>> * Allow setting write.parquet.page-row-limit
>> 
>> 

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Gene Pang
Hi all,

I am one of the main developers for Variant in Apache Spark. David Cashman
(another one of the main Variant developers) and I have been working on
Variant in Spark for a while, and we are excited by the interest from the
Iceberg community!

We have attended some of the Iceberg dev Variant meetings, with Russell and
several others here, and it has been a great experience in sharing ideas
and getting feedback, especially for the shredding scheme. Maybe there was
some sort of miscommunication, but in those meetings we mentioned that we
wanted to move the specification and library out to some other location,
where it would be easier for other projects to use and collaborate on. I
don't know how it ended up that we are not interested in moving them, but I
want to correct that misconception, and ensure that we continue to have the
goal to host the spec and implementation elsewhere. I haven't heard
anything since those meetings, but we have also been trying to figure out
some candidate homes.

There have been good suggestions and ideas in this thread, and they will be
considered as we also coordinate with the Spark and Delta communities,
since those projects are already depending on the existing library. I
appreciate all the ideas for a potential new home for the project!

Thanks,
Gene

On Thu, Aug 15, 2024 at 4:18 PM Micah Kornfield 
wrote:

> Thats fair @Micah, so far all the discussions have been direct and off the
>> dev list. Would you like to make the request on the public Spark Dev list?
>> I would be glad to co-sign, I can also draft up a quick email if you don't
>> have time.
>
>
> I think once we come to consensus, if you have bandwidth, I think the
> message might be better coming from you, as you have more context on some
> of the non-public conversations, the requirements from an Iceberg
> perspective on governance and the blockers that were encountered.  If
> details on the conversations can't be shared, (i.e. we are starting from
> scratch) it seems like suggesting a new project via SPIP might be the way
> forward.  I'm happy to help with that if it is useful but I would guess
> Aihua or Tyler might be in a better place to start as it seems they have
> done more serious thinking here.
>
> If we decide to try to standardize on Parquet or Arrow I'm happy to help
> support the effort in those communities.
>
> Thanks,
> Micah
>
> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer 
> wrote:
>
>> Thats fair @Micah, so far all the discussions have been direct and off
>> the dev list. Would you like to make the request on the public Spark Dev
>> list? I would be glad to co-sign, I can also draft up a quick email if you
>> don't have time.
>>
>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield 
>> wrote:
>>
>>> I agree that it would be beneficial to make a sub-project, the main
 problem is political and not logistic. I've been asking for movement from
 other relative projects for a month and we simply haven't gotten anywhere.
>>>
>>>
>>> I just wanted to double check that these issues were brought directly to
>>> the spark community (i.e. a discussion thread on the Spark developer
>>> mailing list) and not via backchannels.
>>>
>>> I'm not sure the outcome would be different and I don't think this
>>> should block forking the spec, but we should make sure that the decision is
>>> publicly documented within both communities.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 @Gang Wu

 I agree that it would be beneficial to make a sub-project, the main
 problem is political and not logistic. I've been asking for movement from
 other relative projects for a month and we simply haven't gotten anywhere.
 I don't think there is anything that would stop us from moving to a joint
 project in the future and if you know of some way of encouraging that
 movement from other relevant parties I would be glad to collaborate in
 doing that. One thing that I don't want to do is have the Iceberg project
 stay in a holding pattern without any clear roadmap as to how to proceed.

 On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu  wrote:

> I’m on board with copying the spec into our repository. However, as
> we’ve talked about, it’s not just a straightforward copy—there are already
> some divergences. Some of them are under discussion. Iceberg is definitely
> the best place for these specs. Engines like Trino and Flink can then rely
> on the Iceberg specs as a solid foundation.
>
> Yufei
>
> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu  wrote:
>
>> Sorry for chiming in late.
>>
>> From the discussion in
>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>> don't quite understand why it is logistically complicated to create a
>> sub-project to hold the variant spec and impl.
>>
>> IMHO, coping the variant typ

Re: [DISCUSS] REST Endpoint discovery

2024-08-16 Thread Yufei Gu
I’m OK with using a plain string for the endpoint ID, as described in
doc[1]. However, I’ve been thinking about how we can make this more
flexible, especially since we’ve had quite a few discussions about
granularity.

For instance, if we expose a bit more of the API specification, we might
not need some of the feature flags.

It might be worthwhile to consider adopting the OpenAPI path object
schema[2] if we decide to add more structure. Since Iceberg is already
using OpenAPI to define the REST specification and will continue to do so,
this approach could make future extensions easier.

Looking forward to your thoughts on this!

Ref:

   1. API Endpoints
   
   2. Path Object 

Yufei


On Fri, Aug 16, 2024 at 8:58 AM karuppayya  wrote:

> Can we consider adding `OPTIONS` verb to the resource paths, as part of
> the spec?
> That way the endpoint discovery endpoint could return only the list of
> supported endpoints, without the verbs.
>
> `OPTIONS` on the resource path can return the list of all supported verbs,
> and also other information regarding API usage
>
> - Karuppayya
>
> On Fri, Aug 16, 2024 at 7:34 AM Dmitri Bourlatchkov
>  wrote:
>
>> Sorry for a bit of back-tracking, but I'd like to clarify the meaning of
>> those endpoint strings.
>>
>> I initially assumed that the strings would need to be parsed into
>> components (verb / path) for use in runtime. My suggestion for using a JSON
>> representation was meant to make the parsing more standard (i.e. avoid
>> custom format).
>>
>> After reading the impl. PR [1] I think parsing is actually _not_
>> necessary. The whole endpoint string is basically another form of an
>> endpoint "ID".
>>
>> If we agree that the values in the new endpoints config list should be
>> treated as opaque IDs, I think it should be fine to use the simple string
>> representation.
>>
>> WDYT?
>>
>> Thanks,
>> Dmitri.
>>
>> [1] https://github.com/apache/iceberg/pull/10929
>>
>> On Fri, Aug 16, 2024 at 4:43 AM Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>>
>>> If we really want to add more structure to the JSON representation, then
>>> I would probably prefer what Dmitri suggested in the doc as I think {
>>> "GET": {}, "POST": {} } looks a bit weird:
>>>
>>> "endpoints":[
>>>   {"verb": "GET","path": "/v1/{prefix}/namespaces/{namespace}"},
>>>   {"verb": "GET","path": "/v1/{prefix}/namespaces"},
>>>   {"verb": "POST","path": "/v1/{prefix}/namespaces"}
>>> ]
>>>
>>> What do people think of that?
>>>
>>>
>>>
>>> On Fri, Aug 16, 2024 at 8:13 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
 Thank you Eduard for sharing this version of the proposal. Looks
 simple, functional, and extensible.

 On Thu, Aug 15, 2024 at 1:10 PM Ryan Blue 
 wrote:

> I think I'm fine either way. I lean toward the simplicity of the
> strings in the proposal but would not complain if we went with Yufei's
> suggestion.
>
> On Thu, Aug 15, 2024 at 12:12 PM Yufei Gu 
> wrote:
>
>> The current proposal lists endpoints as plain strings, and I still
>> believe we could make things a bit smoother by adding some structure to
>> them. Here's the example if the previous one throws you off.
>>
>> *Before:*
>>
>> "GET /v1/{prefix}/namespaces","POST /v1/{prefix}/namespaces","GET 
>> /v1/{prefix}/namespaces/{namespace}","DELETE 
>> /v1/{prefix}/namespaces/{namespace}"
>>
>> *After:*
>>
>> {"/v1/{prefix}/namespaces": { "GET": {}, "POST": {} } 
>> },{"/v1/{prefix}/namespaces/{namespace}": { "GET": {}, "DELETE": {} } }
>>
>> I think this approach would make the definitions more
>> machine-readable and easier to expand in the future. It's always good to
>> plan for potential changes, and this structure should help us adapt 
>> without
>> much hassle.
>> Plus, we could use a trimmed OpenAPI schema for this, so we wouldn’t
>> need to create a new schema from scratch.
>>
>> Yufei
>>
>>
>> On Thu, Aug 15, 2024 at 11:02 AM Jack Ye  wrote:
>>
>>> > But I propose to use a trimmed openAPI's format directly.
>>> Looking at the example, this feels quite complicated to me.
>>>
>>> > For example, it is easier if we want to include operationID
>>> I don't think we need to consider accommodating both, since
>>> operationId is an alternative to " >> spec>".
>>>
>>> > or adding feature flags, or adding parameters
>>> For more advanced feature flags, those could likely directly go to
>>> overrides just like today, but given we limited the scope to service
>>> endpoint discovery, I think that is another discussion later.
>>>
>>> So the current proposed solution from Eduard still feels better to
>>> me. And I think the argument for ambiguity is pretty strong so I am good
>>

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Weston Pace
+1 to using Arrow to house the spec.  In the interest of expediency I
wonder if we could even store it there "on the side" while we figure out
how to integrate the variant data type with Arrow.

I have a question for those more familiar with the variant spec.  Do we
think it could be introduced as a canonical extension type?  My thinking
right now is "no" because there is no storage type that has the concept of
a "metadata buffer" unless you count dictionary arrays but the metadata
buffer in dictionary arrays already has a specific meaning and so
attempting to smuggle variant metadata into it would be odd.  You could
also use a "variable length binary" with one extra row but that doesn't
work because then you end up with more rows.  You could very awkwardly
create a variable length binary array where the first item in the array is
the metadata buffer + the first variant value but that seems like
stretching things.

If it's not an extension array then it will need to be a new first-class
data type / layout and that will take some work & time to achieve support.
However, I think that may be a prerequisite in some ways anyway.  I believe
(at least some) Iceberg implementations are using Arrow internally (I could
be mistaken)?  If that is so then it seems like implementing the variant
type into Arrow is an inevitability.

On Fri, Aug 16, 2024 at 9:48 AM Gene Pang  wrote:

> Hi all,
>
> I am one of the main developers for Variant in Apache Spark. David Cashman
> (another one of the main Variant developers) and I have been working on
> Variant in Spark for a while, and we are excited by the interest from the
> Iceberg community!
>
> We have attended some of the Iceberg dev Variant meetings, with Russell
> and several others here, and it has been a great experience in sharing
> ideas and getting feedback, especially for the shredding scheme. Maybe
> there was some sort of miscommunication, but in those meetings we mentioned
> that we wanted to move the specification and library out to some other
> location, where it would be easier for other projects to use and
> collaborate on. I don't know how it ended up that we are not interested in
> moving them, but I want to correct that misconception, and ensure that we
> continue to have the goal to host the spec and implementation elsewhere. I
> haven't heard anything since those meetings, but we have also been trying
> to figure out some candidate homes.
>
> There have been good suggestions and ideas in this thread, and they will
> be considered as we also coordinate with the Spark and Delta communities,
> since those projects are already depending on the existing library. I
> appreciate all the ideas for a potential new home for the project!
>
> Thanks,
> Gene
>
> On Thu, Aug 15, 2024 at 4:18 PM Micah Kornfield 
> wrote:
>
>> Thats fair @Micah, so far all the discussions have been direct and off
>>> the dev list. Would you like to make the request on the public Spark Dev
>>> list? I would be glad to co-sign, I can also draft up a quick email if you
>>> don't have time.
>>
>>
>> I think once we come to consensus, if you have bandwidth, I think the
>> message might be better coming from you, as you have more context on some
>> of the non-public conversations, the requirements from an Iceberg
>> perspective on governance and the blockers that were encountered.  If
>> details on the conversations can't be shared, (i.e. we are starting from
>> scratch) it seems like suggesting a new project via SPIP might be the way
>> forward.  I'm happy to help with that if it is useful but I would guess
>> Aihua or Tyler might be in a better place to start as it seems they have
>> done more serious thinking here.
>>
>> If we decide to try to standardize on Parquet or Arrow I'm happy to help
>> support the effort in those communities.
>>
>> Thanks,
>> Micah
>>
>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> Thats fair @Micah, so far all the discussions have been direct and off
>>> the dev list. Would you like to make the request on the public Spark Dev
>>> list? I would be glad to co-sign, I can also draft up a quick email if you
>>> don't have time.
>>>
>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield 
>>> wrote:
>>>
 I agree that it would be beneficial to make a sub-project, the main
> problem is political and not logistic. I've been asking for movement from
> other relative projects for a month and we simply haven't gotten anywhere.


 I just wanted to double check that these issues were brought directly
 to the spark community (i.e. a discussion thread on the Spark developer
 mailing list) and not via backchannels.

 I'm not sure the outcome would be different and I don't think this
 should block forking the spec, but we should make sure that the decision is
 publicly documented within both communities.

 Thanks,
 Micah

 On Thu, Aug 15, 2024 at 7:47 AM Russell Spi

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul
Hi Walaa,I would argue that for the refresh operation the query engine has to parse the query and then somehow execute it. For a full refresh it will directly execute the query and for a incremental refresh it will execute a modified version. Therefore it has to fully expand the query tree.Best wishes, JanAm 16.08.2024 18:13 schrieb Walaa Eldin Moustafa :Thanks Jan for the summary.For this point:> For a refresh operation the query engine has to parse the SQL and fully expand the lineage with it's children anyway.  So the lineage is not strictly required.If the lineage is provided at creation time by the respective engine, the refresh operation does not need to parse the SQL, correct?Thanks,Walaa.On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul  wrote:

  

  
  
As the table I created is not properly shown in the mailing list
  I'll reformat the summary of the different drawbacks again:
Drawbacks of (no lineage, refresh-state key = identifier):
- introduces catalog identifiers into table metadata (#4)
  - query engine has to expand lineage at refresh time (happens
  anyway)
  
  Drawbacks of (normalized lineage, refresh-state key = uuid):
- recursive calls to catalog to expand lineage at read time (#2)
  - fragile by requiring child views to have lineage field
Drawbacks of (denormalized lineage, refresh-state key = uuid):
- update of materialized view version required if child view is
  updated (#5)
On 16.08.24 09:17, Jan Kaul wrote:


  
  Hi,
  Thanks Micah for clearly stating the requirements. I think this
gives better clarity for the discussion. 
  It seems like we don't have a solution that satisfies all
requirements at once. So we would need to choose which has the
fewest drawbacks.
  I would like to summarize the different drawbacks that came up
in the discussion.
  no lineage 
  + refresh-state key = identifier
normalized lineage 
  + refresh-state key = uuid
denormalized lineage 
  + refresh-state key = uuid- introduces catalog identifiers into table
  metadata (#4)
  - query engine has to expand lineage at refresh time
  (happens anyway)
- recursive calls to catalog to expand
  lineage at read time (#2)
  - fragile by requiring child views to have lineage field
- update of materialized view version
  required if child view is updated (#5)

  With identifiers as the refresh-state keys, the lineage is not
strictly required and becomes an orthogonal proposal. That's why
I left it out if the comparison.
  In my opinion introducing catalog identifiers into the table
metadata (requirement #4) is the least significant drawback as
it is not a technical reason but more about semantics.
Especially as the identifiers are not introduced into the table
spec but are rather stored in the snapshot summary. That's why
I'm in favor of using the catalog identifiers as the
refresh-state keys.
  
  Regarding your last point Walaa:
  
  The option of using catalog identifiers in
the state map still requires keeping lineage information in the
view because REFRESH MV needs the latest fully expanded children
(which could have changed from the set of children currently in
the state map), without reparsing the view tree.
  
   For a refresh operation the query engine has to parse the SQL
and fully expand the lineage with it's children anyway.  So the
lineage is not strictly required.
  If I understand correctly, most of you are also in favor of
using catalog identifiers + ref as the refresh-state keys and
postponing the lineage proposal.
  I hope that we can move the discussion forward.
  Jan
  
  On 16.08.24 08:07, Walaa Eldin
Moustafa wrote:
  
  

The option of using catalog identifiers in the
  state map still requires keeping lineage information in the
  view because REFRESH MV needs the latest fully expanded
  children (which could have changed from the set of children
  currently in the state map), without reparsing the view tree.
  Therefore, catalog identifiers in the state map, does not
  eliminate the need for tracking children in the form of
  catalog identifiers in the lineage side (but in this case
  lineage will be a set instead of just a map).
  
  
  Hence, my concerns with using catalog identifiers (as
opposed to UUIDs) are:
* The fundamental issue where the table spec depends
  on/refers to the view spec

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Ryan Blue
I think Parquet is a better place for the variant spec than Arrow. Parquet
is upstream of nearly every project (other than ORC) so it is a good place
to standardize and facilitate discussions across communities. There are
also existing relationships and connections to the Parquet community
because of its widespread use. Arrow is not used as Spark's in-memory
model, nor Trino and others so those existing relationships aren't there. I
also worry that differences in approaches would make it difficult later on.

This is a fairly specific and orthogonal spec for the in-memory portion so
I don't see much value in maintaining it in the Arrow community. The main
place where it is evolving right now (besides adding type IDs) is in
shredding and storage.


On Fri, Aug 16, 2024 at 9:59 AM Weston Pace  wrote:

> +1 to using Arrow to house the spec.  In the interest of expediency I
> wonder if we could even store it there "on the side" while we figure out
> how to integrate the variant data type with Arrow.
>
> I have a question for those more familiar with the variant spec.  Do we
> think it could be introduced as a canonical extension type?  My thinking
> right now is "no" because there is no storage type that has the concept of
> a "metadata buffer" unless you count dictionary arrays but the metadata
> buffer in dictionary arrays already has a specific meaning and so
> attempting to smuggle variant metadata into it would be odd.  You could
> also use a "variable length binary" with one extra row but that doesn't
> work because then you end up with more rows.  You could very awkwardly
> create a variable length binary array where the first item in the array is
> the metadata buffer + the first variant value but that seems like
> stretching things.
>
> If it's not an extension array then it will need to be a new first-class
> data type / layout and that will take some work & time to achieve support.
> However, I think that may be a prerequisite in some ways anyway.  I believe
> (at least some) Iceberg implementations are using Arrow internally (I could
> be mistaken)?  If that is so then it seems like implementing the variant
> type into Arrow is an inevitability.
>
> On Fri, Aug 16, 2024 at 9:48 AM Gene Pang  wrote:
>
>> Hi all,
>>
>> I am one of the main developers for Variant in Apache Spark. David
>> Cashman (another one of the main Variant developers) and I have been
>> working on Variant in Spark for a while, and we are excited by the interest
>> from the Iceberg community!
>>
>> We have attended some of the Iceberg dev Variant meetings, with Russell
>> and several others here, and it has been a great experience in sharing
>> ideas and getting feedback, especially for the shredding scheme. Maybe
>> there was some sort of miscommunication, but in those meetings we mentioned
>> that we wanted to move the specification and library out to some other
>> location, where it would be easier for other projects to use and
>> collaborate on. I don't know how it ended up that we are not interested in
>> moving them, but I want to correct that misconception, and ensure that we
>> continue to have the goal to host the spec and implementation elsewhere. I
>> haven't heard anything since those meetings, but we have also been trying
>> to figure out some candidate homes.
>>
>> There have been good suggestions and ideas in this thread, and they will
>> be considered as we also coordinate with the Spark and Delta communities,
>> since those projects are already depending on the existing library. I
>> appreciate all the ideas for a potential new home for the project!
>>
>> Thanks,
>> Gene
>>
>> On Thu, Aug 15, 2024 at 4:18 PM Micah Kornfield 
>> wrote:
>>
>>> Thats fair @Micah, so far all the discussions have been direct and off
 the dev list. Would you like to make the request on the public Spark Dev
 list? I would be glad to co-sign, I can also draft up a quick email if you
 don't have time.
>>>
>>>
>>> I think once we come to consensus, if you have bandwidth, I think the
>>> message might be better coming from you, as you have more context on some
>>> of the non-public conversations, the requirements from an Iceberg
>>> perspective on governance and the blockers that were encountered.  If
>>> details on the conversations can't be shared, (i.e. we are starting from
>>> scratch) it seems like suggesting a new project via SPIP might be the way
>>> forward.  I'm happy to help with that if it is useful but I would guess
>>> Aihua or Tyler might be in a better place to start as it seems they have
>>> done more serious thinking here.
>>>
>>> If we decide to try to standardize on Parquet or Arrow I'm happy to help
>>> support the effort in those communities.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 Thats fair @Micah, so far all the discussions have been direct and off
 the dev list. Would you like to make the request on the public S

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Weston Pace
> Parquet is upstream of nearly every project (other than ORC)

I disagree with this statement.  There is a difference between being
upstream and being the internal format in use.  For example, datafusion,
duckdb, ray, etc. all have parquet upstream but all of them use Arrow as
the internal memory model.  These things will (presumably) want to start
working with variant data and so variant arrays will end up in Arrow at
some point.

That being said, I agree that the shredding / storage extensions don't
really make much sense in Arrow.  I suspect, like many other things, there
will be storage oriented versions and in-memory oriented versions (e.g. REE
is only newly added to arrow and bitpacking has not yet been added because
they tend to be more storage-oriented).

I guess I don't really see the problem in both Arrow and Parquet having a
variant spec.  Both impls already have their own type systems so we are
just perpetuating an existing problem.  I imagine an in-memory variant
might have a richer set of type codes (e.g. parameterized types could be
fully handled by dictionary encoding the parameterized types used by an
array in the metadata buffer.  This makes more sense in-memory where
compression is not as vital as flexibility) so that there can be 1:1
compatibility with the non-variant type system (e.g. storing
fixed-size-binary in a variant, or extension types like UUID).  Then, when
storing in-memory data, Arrow/Parquet bridges will have to make decisions
on how to encode its types into parquet types (we already need to do this
since the type systems are not fully compatible).  Though I can't see off
the top of my head how exactly that will happen I'm sure we'll figure
something out (e.g. could replace the field name with a placeholder field
"__placeholder_0" and then store a table in the file metadata (e.g.
__placeholder_0=my_uuid_col,fixed_size_list<16>) to lookup the true types.

On Fri, Aug 16, 2024 at 11:17 AM Ryan Blue 
wrote:

> I think Parquet is a better place for the variant spec than Arrow. Parquet
> is upstream of nearly every project (other than ORC) so it is a good place
> to standardize and facilitate discussions across communities. There are
> also existing relationships and connections to the Parquet community
> because of its widespread use. Arrow is not used as Spark's in-memory
> model, nor Trino and others so those existing relationships aren't there. I
> also worry that differences in approaches would make it difficult later on.
>
> This is a fairly specific and orthogonal spec for the in-memory portion so
> I don't see much value in maintaining it in the Arrow community. The main
> place where it is evolving right now (besides adding type IDs) is in
> shredding and storage.
>
>
> On Fri, Aug 16, 2024 at 9:59 AM Weston Pace  wrote:
>
>> +1 to using Arrow to house the spec.  In the interest of expediency I
>> wonder if we could even store it there "on the side" while we figure out
>> how to integrate the variant data type with Arrow.
>>
>> I have a question for those more familiar with the variant spec.  Do we
>> think it could be introduced as a canonical extension type?  My thinking
>> right now is "no" because there is no storage type that has the concept of
>> a "metadata buffer" unless you count dictionary arrays but the metadata
>> buffer in dictionary arrays already has a specific meaning and so
>> attempting to smuggle variant metadata into it would be odd.  You could
>> also use a "variable length binary" with one extra row but that doesn't
>> work because then you end up with more rows.  You could very awkwardly
>> create a variable length binary array where the first item in the array is
>> the metadata buffer + the first variant value but that seems like
>> stretching things.
>>
>> If it's not an extension array then it will need to be a new first-class
>> data type / layout and that will take some work & time to achieve support.
>> However, I think that may be a prerequisite in some ways anyway.  I believe
>> (at least some) Iceberg implementations are using Arrow internally (I could
>> be mistaken)?  If that is so then it seems like implementing the variant
>> type into Arrow is an inevitability.
>>
>> On Fri, Aug 16, 2024 at 9:48 AM Gene Pang  wrote:
>>
>>> Hi all,
>>>
>>> I am one of the main developers for Variant in Apache Spark. David
>>> Cashman (another one of the main Variant developers) and I have been
>>> working on Variant in Spark for a while, and we are excited by the interest
>>> from the Iceberg community!
>>>
>>> We have attended some of the Iceberg dev Variant meetings, with Russell
>>> and several others here, and it has been a great experience in sharing
>>> ideas and getting feedback, especially for the shredding scheme. Maybe
>>> there was some sort of miscommunication, but in those meetings we mentioned
>>> that we wanted to move the specification and library out to some other
>>> location, where it would be easier for other projects to use and

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Gene Pang
I think Parquet might be a better home over Arrow. Ryan already brought up
interesting points, especially with all of the storage related details and
discussions, like shredding.

Another aspect to this is that while working on Variant, we had ideas of
adding a Variant logical type to Parquet. We thought Variant could make a
lot of sense as a Parquet type, to make it more convenient to implement
Variant in other projects. We never got that started since we were busy
building out the features and library, but maybe if there are more people
from the various communities, that could become a reality sooner.

Thanks,
Gene

On Fri, Aug 16, 2024 at 11:18 AM Ryan Blue 
wrote:

> I think Parquet is a better place for the variant spec than Arrow. Parquet
> is upstream of nearly every project (other than ORC) so it is a good place
> to standardize and facilitate discussions across communities. There are
> also existing relationships and connections to the Parquet community
> because of its widespread use. Arrow is not used as Spark's in-memory
> model, nor Trino and others so those existing relationships aren't there. I
> also worry that differences in approaches would make it difficult later on.
>
> This is a fairly specific and orthogonal spec for the in-memory portion so
> I don't see much value in maintaining it in the Arrow community. The main
> place where it is evolving right now (besides adding type IDs) is in
> shredding and storage.
>
>
> On Fri, Aug 16, 2024 at 9:59 AM Weston Pace  wrote:
>
>> +1 to using Arrow to house the spec.  In the interest of expediency I
>> wonder if we could even store it there "on the side" while we figure out
>> how to integrate the variant data type with Arrow.
>>
>> I have a question for those more familiar with the variant spec.  Do we
>> think it could be introduced as a canonical extension type?  My thinking
>> right now is "no" because there is no storage type that has the concept of
>> a "metadata buffer" unless you count dictionary arrays but the metadata
>> buffer in dictionary arrays already has a specific meaning and so
>> attempting to smuggle variant metadata into it would be odd.  You could
>> also use a "variable length binary" with one extra row but that doesn't
>> work because then you end up with more rows.  You could very awkwardly
>> create a variable length binary array where the first item in the array is
>> the metadata buffer + the first variant value but that seems like
>> stretching things.
>>
>> If it's not an extension array then it will need to be a new first-class
>> data type / layout and that will take some work & time to achieve support.
>> However, I think that may be a prerequisite in some ways anyway.  I believe
>> (at least some) Iceberg implementations are using Arrow internally (I could
>> be mistaken)?  If that is so then it seems like implementing the variant
>> type into Arrow is an inevitability.
>>
>> On Fri, Aug 16, 2024 at 9:48 AM Gene Pang  wrote:
>>
>>> Hi all,
>>>
>>> I am one of the main developers for Variant in Apache Spark. David
>>> Cashman (another one of the main Variant developers) and I have been
>>> working on Variant in Spark for a while, and we are excited by the interest
>>> from the Iceberg community!
>>>
>>> We have attended some of the Iceberg dev Variant meetings, with Russell
>>> and several others here, and it has been a great experience in sharing
>>> ideas and getting feedback, especially for the shredding scheme. Maybe
>>> there was some sort of miscommunication, but in those meetings we mentioned
>>> that we wanted to move the specification and library out to some other
>>> location, where it would be easier for other projects to use and
>>> collaborate on. I don't know how it ended up that we are not interested in
>>> moving them, but I want to correct that misconception, and ensure that we
>>> continue to have the goal to host the spec and implementation elsewhere. I
>>> haven't heard anything since those meetings, but we have also been trying
>>> to figure out some candidate homes.
>>>
>>> There have been good suggestions and ideas in this thread, and they will
>>> be considered as we also coordinate with the Spark and Delta communities,
>>> since those projects are already depending on the existing library. I
>>> appreciate all the ideas for a potential new home for the project!
>>>
>>> Thanks,
>>> Gene
>>>
>>> On Thu, Aug 15, 2024 at 4:18 PM Micah Kornfield 
>>> wrote:
>>>
 Thats fair @Micah, so far all the discussions have been direct and off
> the dev list. Would you like to make the request on the public Spark Dev
> list? I would be glad to co-sign, I can also draft up a quick email if you
> don't have time.


 I think once we come to consensus, if you have bandwidth, I think the
 message might be better coming from you, as you have more context on some
 of the non-public conversations, the requirements from an Iceberg
 perspective on governance and the blocker

Re: [EXTERNAL] Re: Iceberg-arrow vectorized read bug

2024-08-16 Thread Lessard, Steve
Hi Eduard,

Thank you for offering to help with this issue. If you are able to find some 
time to look at this issue I’d be sure to make some time to collaborate with 
you.

I have been continuing to investigate the issue. I still do not have a correct 
solution, but I believe I have something closer to a workable solution. I built 
upon the two pull requests I shared in my previous email and came up with the 
solution inhttps://github.com/apache/iceberg/pull/10953. I annotated that pull 
request to point out areas I know are problematic.

-Steve Lessard, Teradata.




From: Eduard Tudenhöfner 
Date: Friday, August 16, 2024 at 2:45 AM
To: dev@iceberg.apache.org 
Subject: [EXTERNAL] Re: Iceberg-arrow vectorized read bug
[CAUTION: External Email]

Hey Steve,

It's been a long time since I did some work on the iceberg-arrow module but I 
will try to find some time next week to analyze the problem in detail and see 
what options we have for fixing it.

Thanks for your patience here.

Eduard

On Mon, Aug 12, 2024 at 9:00 PM Lessard, Steve 
 wrote:

Iceberg-arrow is a non-trivial chunk of code that appears to me as though it 
has no current owner. Based on the history I find in GitHub it appears the 
original author of the org.apache.iceberg.arrow.vectorized package, Mayur 
Srivastava, wrote this package as his one and only contribution to any public 
repository in GitHub.com and hasn’t been seen since.  That was back in April of 
2021.
 Now I need to fix a bug in this package and cannot find anyone to mentor me 
through changing this complex package.



I’ve taken yet another look at this bug in the vectorized Arrow reader code, 
issue #10275. As a summary the 
issue is that vectorized reads fail when the parquet file does not contain a 
column that exists within the Iceberg schema. This is a schema evolution issue. 
I spent several hours studying the code, learning where and how readers for 
each column are created, how the vector holder for each column is produced, and 
how the value for each column is consumed via an ArrowVectorAccessor instance. 
I have concluded that this isn’t really a bug so much as a never implemented 
feature. There are two main reasons for my conclusion:

1.  The code currently thows a NullPointerException when trying to read a 
column that exists in the Iceberg schema but does not exist in the parquet 
schema

2.  This comment in the source code for ArrowBatchReader.java line 
53,
 “// Handle null vector for constant case.” I believe this is a TODO comment.



So, what’s the path forward? At a minimum, and as a first step, I believe my 
original PR for this issue should be approved and merged. That PR is 
https://github.com/apache/iceberg/pull/10284. That PR simply prevents throwing 
a NullPointerException and instead throws an UnsupportedOperationException 
(which is the same exception thrown as the default case for trying to read any 
unsupported column type.)



As I wrote earlier that could be just a first step. What could be the next 
step? Well, I drafted two hacks of the source code to help me explore and 
hopefully inform a potential solution. I want to emphasize that the pull 
requests are hacks and not intended nor proposed to be merged into Apache 
Iceberg.



First, I wrote this hack which explores adding a reader for the missing column. 
https://github.com/apache/iceberg/pull/10922 Ultimately this solution failed 
because, though I hacked together a reader, there was nothing to read inside 
the parquet file. The meat of this change is in 
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java.



Second, I write this hack which changes the dummy holder currently used for 
null values. https://github.com/apache/iceberg/pull/10923 This hack ultimately 
fails for two reasons: 1) The VectorHolder instance I created is hacked 
together with hard coded information that just isn’t available at this point in 
the call stack. 2) I don’t know how to make anything consume the NullAccessor 
class I created to access the null value from the NullVector instance.



I am hoping two things will come out of this email message:

1.  My original pull request will be approved and merged. 
https://github.com/apache/iceberg/pull/10284

2.  One or both of my hacks will inform / inspire someone on this list to 
dream up a better solution and mentor me so that I may code up that solution 
and further explore it.











From: Lessard, Steve 
Date: Thursday, August 1, 2024 at 4:38 PM
To: dev@iceberg.apache.org 
mailto:dev@iceberg.apache.org>>, Amogh Jahagirdar 
<2am...@gmail.com>
Cc: dev@iceberg.apache.or

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Reynold Xin
My $0.02 (as an Apache Spark PMC member):

It'd be very unfortunate if there emerges multiple variant specs at the
physical storage layer. The most important thing is interoperability at the
physical storage layer, since that's by far the most expensive to
"convert". Forking will inevitably lead to divergence (otherwise why fork?).

Parquet would be a great place to host the variant spec. I always view
Parquet as the de facto standard for data lake storage (since its adoption
is far, far larger than any other things these days, despite its
shortcomings). And of course, Spark, and virtually every other project,
already depends on Parquet.

When I was first invited to join the Arrow PMC (and regrettably I haven't
contributed any real code directly to Arrow itself), my understanding was
that Arrow was meant for a fast exchange format, and I don't think that has
really changed much today. Despite the massive scope increase of Arrow over
the years (with a lot of different new modules, sub projects), the most
successful one is still the small kernel for fast data exchange.


On Fri, Aug 16, 2024 at 9:00 PM Gene Pang  wrote:

> I think Parquet might be a better home over Arrow. Ryan already brought up
> interesting points, especially with all of the storage related details and
> discussions, like shredding.
>
> Another aspect to this is that while working on Variant, we had ideas of
> adding a Variant logical type to Parquet. We thought Variant could make a
> lot of sense as a Parquet type, to make it more convenient to implement
> Variant in other projects. We never got that started since we were busy
> building out the features and library, but maybe if there are more people
> from the various communities, that could become a reality sooner.
>
> Thanks,
> Gene
>
> On Fri, Aug 16, 2024 at 11:18 AM Ryan Blue 
> wrote:
>
>> I think Parquet is a better place for the variant spec than Arrow.
>> Parquet is upstream of nearly every project (other than ORC) so it is a
>> good place to standardize and facilitate discussions across communities.
>> There are also existing relationships and connections to the Parquet
>> community because of its widespread use. Arrow is not used as Spark's
>> in-memory model, nor Trino and others so those existing relationships
>> aren't there. I also worry that differences in approaches would make it
>> difficult later on.
>>
>> This is a fairly specific and orthogonal spec for the in-memory portion
>> so I don't see much value in maintaining it in the Arrow community. The
>> main place where it is evolving right now (besides adding type IDs) is in
>> shredding and storage.
>>
>>
>> On Fri, Aug 16, 2024 at 9:59 AM Weston Pace 
>> wrote:
>>
>>> +1 to using Arrow to house the spec.  In the interest of expediency I
>>> wonder if we could even store it there "on the side" while we figure out
>>> how to integrate the variant data type with Arrow.
>>>
>>> I have a question for those more familiar with the variant spec.  Do we
>>> think it could be introduced as a canonical extension type?  My thinking
>>> right now is "no" because there is no storage type that has the concept of
>>> a "metadata buffer" unless you count dictionary arrays but the metadata
>>> buffer in dictionary arrays already has a specific meaning and so
>>> attempting to smuggle variant metadata into it would be odd.  You could
>>> also use a "variable length binary" with one extra row but that doesn't
>>> work because then you end up with more rows.  You could very awkwardly
>>> create a variable length binary array where the first item in the array is
>>> the metadata buffer + the first variant value but that seems like
>>> stretching things.
>>>
>>> If it's not an extension array then it will need to be a new first-class
>>> data type / layout and that will take some work & time to achieve support.
>>> However, I think that may be a prerequisite in some ways anyway.  I believe
>>> (at least some) Iceberg implementations are using Arrow internally (I could
>>> be mistaken)?  If that is so then it seems like implementing the variant
>>> type into Arrow is an inevitability.
>>>
>>> On Fri, Aug 16, 2024 at 9:48 AM Gene Pang  wrote:
>>>
 Hi all,

 I am one of the main developers for Variant in Apache Spark. David
 Cashman (another one of the main Variant developers) and I have been
 working on Variant in Spark for a while, and we are excited by the interest
 from the Iceberg community!

 We have attended some of the Iceberg dev Variant meetings, with Russell
 and several others here, and it has been a great experience in sharing
 ideas and getting feedback, especially for the shredding scheme. Maybe
 there was some sort of miscommunication, but in those meetings we mentioned
 that we wanted to move the specification and library out to some other
 location, where it would be easier for other projects to use and
 collaborate on. I don't know how it ended up that we are not in

Type promotion in v3

2024-08-16 Thread Ryan Blue
I’ve recently been working on updating the spec for new types and type
promotion cases in v3.

I was talking to Micah and he pointed out an issue with type promotion: the
upper and lower bounds for data file columns that are kept in Avro
manifests don’t have any information about the type that was used to encode
the bounds.

For example, when writing to a table with a float column, 4: f, the
manifest’s lower_bounds and upper_bounds maps will have an entry with the
type ID (4) as the key and a 4-byte encoded float for the value. If column f
were later promoted to double, those maps aren’t changed. The way we
currently detect that the type was promoted is to check the binary value
and read it as a float if there are 4 bytes instead of 8. This prevents us
from adding int to double type promotion because when there are 4 bytes we
would not know whether the value was originally an int or a float.

Several of the type promotion cases from my previous email hit this
problem. Date/time types to string, int and long to string, and long to
timestamp are all affected. I think the best path forward is to add fewer
type promotion cases to v3 and support only these new cases:

   - int and long to string
   - date to timestamp
   - null/unknown to any
   - any to variant (if supported by the Variant spec)

That list would allow us to keep using the current strategy and not add new
metadata to track the type to our manifests. My rationale for not adding
new information to track the bound types at the time that the data file
metadata is created is that it would inflate the size of manifests and push
out the timeline for getting v3 done. Many of us would like to get v3
released to get the timestamp_ns and variant types out. And if we can get
at least some of the promotion cases out that’s better.

To address type promotion in the long term, I think that we should consider
moving to Parquet manifests. This has been suggested a few times so that we
can project just the lower and upper bounds that are needed for scan
planning. That would also fix type promotion because the manifest file
schema would include full type information for the stats columns. Given the
complexity of releasing Parquet manifests, I think it makes more sense to
get a few promotion cases done now in v3 and follow up with the rest in v4.

Ryan

-- 
Ryan Blue


Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Will Jones
In being more engine and format agnostic, I agree the Arrow project might
be a good host for such a specification. It seems like we want to move away
from hosting in Spark to make it engine agnostic. But moving into Iceberg
might make it less format agnostic, as I understand multiple formats might
want to implement this. I'm not intimately familiar with the state of this,
but I believe Delta Lake would like to be aligned with the same format as
Iceberg. In addition, the Lance format (which I work on), will eventually
be interesting as well. It seems equally bad to me to attach this
specification to a particular table format as it does a particular query
engine.

That being said, I think the most important consideration for now is where
are the current maintainers / contributors to the variant type. If most of
them are already PMC members / committers on a project, it becomes a bit
easier. Otherwise if there isn't much overlap with a project's existing
governance, I worry there could be a bit of friction. How many active
contributors are there from Iceberg? And how about from Arrow?

BTW, I'd add I'm interested in helping develop an Arrow extension type for
the binary variant type. I've been experimenting with a DataFusion
extension that operates on this [1], and already have some ideas on how
such an extension type might be defined. I'm not yet caught up on the
shredded specification, but I think having just the binary format would be
beneficial for in-memory analytics, which are most relevant to Arrow. I'll
be creating a seperate thread on the Arrow ML about this soon.

Best,

Will Jones

[1]
https://github.com/datafusion-contrib/datafusion-functions-variant/issues


On Thu, Aug 15, 2024 at 7:39 PM Gang Wu  wrote:

> + dev@arrow
>
> Thanks for all the valuable suggestions! I am inclined to Micah's idea that
> Arrow might be a better host compared to Parquet.
>
> To give more context, I am taking the initiative to add the geometry type
> to both Parquet and ORC. I'd like to do the same thing for variant type in
> that variant type is engine and file format agnostic. This does mean that
> Parquet might not be the neutral place to hold the variant spec.
>
> Best,
> Gang
>
> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li 
> wrote:
>
> > Thanks all for your discussion.
> >
> > The Apache Paimon community is also considering support for this
> > Variant type, without a doubt, we hope to maintain consistency with
> > Iceberg.
> >
> > Not only the Paimon community, but also various computing engines need
> > to adapt to this type, such as Flink and StarRocks. We also hope to
> > promote them to adapt to this type.
> >
> > It is worth noting that we also need to standardize many functions
> > related to it.
> >
> > A neutral place to maintain it is a great choice.
> >
> > - As Gang Wu said, a standalone project is good, just like RoaringBitmap
> > [1].
> > - As Ryan said, Parquet community is a neutral option too.
> > - As Micah said, Arrow is also an option too.
> >
> > [1] https://github.com/RoaringBitmap
> >
> > Best,
> > Jingsong
> >
> > On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield 
> > wrote:
> > >>
> > >> Thats fair @Micah, so far all the discussions have been direct and off
> > the dev list. Would you like to make the request on the public Spark Dev
> > list? I would be glad to co-sign, I can also draft up a quick email if
> you
> > don't have time.
> > >
> > >
> > > I think once we come to consensus, if you have bandwidth, I think the
> > message might be better coming from you, as you have more context on some
> > of the non-public conversations, the requirements from an Iceberg
> > perspective on governance and the blockers that were encountered.  If
> > details on the conversations can't be shared, (i.e. we are starting from
> > scratch) it seems like suggesting a new project via SPIP might be the way
> > forward.  I'm happy to help with that if it is useful but I would guess
> > Aihua or Tyler might be in a better place to start as it seems they have
> > done more serious thinking here.
> > >
> > > If we decide to try to standardize on Parquet or Arrow I'm happy to
> help
> > support the effort in those communities.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
> > russell.spit...@gmail.com> wrote:
> > >>
> > >> Thats fair @Micah, so far all the discussions have been direct and off
> > the dev list. Would you like to make the request on the public Spark Dev
> > list? I would be glad to co-sign, I can also draft up a quick email if
> you
> > don't have time.
> > >>
> > >> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > wrote:
> > 
> >  I agree that it would be beneficial to make a sub-project, the main
> > problem is political and not logistic. I've been asking for movement from
> > other relative projects for a month and we simply haven't gotten
> anywhere.
> > >>>
> > >>>
> > >>> I just wanted to double check that

[DISCUSS] Row Lineage Proposal

2024-08-16 Thread Russell Spitzer
Hi Y'all,

We've been working on a new proposal to add Row Lineage to Iceberg in the
V3 Spec. The general idea is to give every row a unique identifier as well
as a marker of what version of the row it is. This should let us build a
variety of features related to CDC, Incremental Processing and Audit
Logging. If you are interested please check out the linked proposal below.
This will require compliance from all engines to be really useful so It's
important we come to consensus on whether or not this is possible.

https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing


Thank you for your consideration,
Russ


Re: [VOTE] Merge REST spec clarification on how servers should handle unknown updates/requirements

2024-08-16 Thread Amogh Jahagirdar
The vote passes 7 +1 binding votes and 1 +1 non-binding vote and I will
merge the spec change.
Thanks everyone for providing feedback and voting!

Thanks,
Amogh Jahagirdar


On Wed, Aug 14, 2024 at 9:23 PM Renjie Liu  wrote:

> +1
>
> On Thu, Aug 15, 2024 at 10:10 AM Jack Ye  wrote:
>
>> +1
>>
>> -Jack
>>
>> On Wed, Aug 14, 2024 at 8:39 AM Daniel Weeks  wrote:
>>
>>> +1
>>>
>>> On Tue, Aug 13, 2024 at 7:34 PM xianjin  wrote:
>>>
 +1

 On Aug 14, 2024, at 2:24 AM, Ryan Blue 
 wrote:

 
 +1

 On Tue, Aug 13, 2024 at 8:59 AM Yufei Gu  wrote:

> +1
> Yufei
>
>
> On Tue, Aug 13, 2024 at 8:57 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> +1
>>
>> On Tue, Aug 13, 2024 at 5:09 PM Amogh Jahagirdar <2am...@gmail.com>
>> wrote:
>>
>>> I've opened a PR [1] to clarify in the REST spec that if a server
>>> receives an unknown update or requirement as part of any the commit
>>> endpoints, that the server must fail with a 400 error response.
>>>
>>> Please vote on merging this change.  The vote will remain open for
>>> at least 72 hours.
>>>
>>> [] +1
>>> [] +0
>>> [] -1, do not merge because ...
>>>
>>> [1] https://github.com/apache/iceberg/pull/10848
>>>
>>> Thanks,
>>>
>>> Amogh Jahagirdar
>>>
>>

 --
 Ryan Blue
 Databricks




Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
That is right. I agree that in the case of using catalog identifiers in
state information, using them in lineage information would be a
nice-to-have and not a requirement.

However, this still does not address the semantic issue which is more
fundamental in my opinion. The Iceberg table spec is not aware of catalog
table identifiers and this use will be the first break of this abstraction.

On a side note, it does not address the denormalization issue either if we
ever want to introduce the lineage in the view as a nice-to-have.

Thanks,
Walaa.


On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul 
wrote:

> Hi Walaa,
>
> I would argue that for the refresh operation the query engine has to parse
> the query and then somehow execute it. For a full refresh it will directly
> execute the query and for a incremental refresh it will execute a modified
> version. Therefore it has to fully expand the query tree.
>
> Best wishes,
>
> Jan
>
> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa :
>
> Thanks Jan for the summary.
>
> For this point:
>
> > For a refresh operation the query engine has to parse the SQL and fully
> expand the lineage with it's children anyway.  So the lineage is not
> strictly required.
>
> If the lineage is provided at creation time by the respective engine, the
> refresh operation does not need to parse the SQL, correct?
>
> Thanks,
> Walaa.
>
> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
> wrote:
>
> As the table I created is not properly shown in the mailing list I'll
> reformat the summary of the different drawbacks again:
>
> Drawbacks of (no lineage, refresh-state key = identifier):
>
> - introduces catalog identifiers into table metadata (#4)
> - query engine has to expand lineage at refresh time (happens anyway)
>
> Drawbacks of (normalized lineage, refresh-state key = uuid):
>
> - recursive calls to catalog to expand lineage at read time (#2)
> - fragile by requiring child views to have lineage field
>
> Drawbacks of (denormalized lineage, refresh-state key = uuid):
>
> - update of materialized view version required if child view is updated
> (#5)
> On 16.08.24 09:17, Jan Kaul wrote:
>
> Hi,
>
> Thanks Micah for clearly stating the requirements. I think this gives
> better clarity for the discussion.
>
> It seems like we don't have a solution that satisfies all requirements at
> once. So we would need to choose which has the fewest drawbacks.
>
> I would like to summarize the different drawbacks that came up in the
> discussion.
> no lineage
> + refresh-state key = identifier
> normalized lineage
> + refresh-state key = uuid
> denormalized lineage
> + refresh-state key = uuid
> - introduces catalog identifiers into table metadata (#4)
> - query engine has to expand lineage at refresh time (happens anyway)
> - recursive calls to catalog to expand lineage at read time (#2)
> - fragile by requiring child views to have lineage field
> - update of materialized view version required if child view is updated
> (#5)
>
> With identifiers as the refresh-state keys, the lineage is not strictly
> required and becomes an orthogonal proposal. That's why I left it out if
> the comparison.
>
> In my opinion introducing catalog identifiers into the table metadata
> (requirement #4) is the least significant drawback as it is not a technical
> reason but more about semantics. Especially as the identifiers are not
> introduced into the table spec but are rather stored in the snapshot
> summary. That's why I'm in favor of using the catalog identifiers as the
> refresh-state keys.
>
> Regarding your last point Walaa:
>
> The option of using catalog identifiers in the state map still requires
> keeping lineage information in the view because REFRESH MV needs the latest
> fully expanded children (which could have changed from the set of children
> currently in the state map), without reparsing the view tree.
>
> For a refresh operation the query engine has to parse the SQL and fully
> expand the lineage with it's children anyway.  So the lineage is not
> strictly required.
>
> If I understand correctly, most of you are also in favor of using catalog
> identifiers + ref as the refresh-state keys and postponing the lineage
> proposal.
>
> I hope that we can move the discussion forward.
>
> Jan
> On 16.08.24 08:07, Walaa Eldin Moustafa wrote:
>
> The option of using catalog identifiers in the state map still requires
> keeping lineage information in the view because REFRESH MV needs the latest
> fully expanded children (which could have changed from the set of children
> currently in the state map), without reparsing the view tree. Therefore,
> catalog identifiers in the state map, does not eliminate the need for
> tracking children in the form of catalog identifiers in the lineage side
> (but in this case lineage will be a set instead of just a map).
>
> Hence, my concerns with using catalog identifiers (as opposed to UUIDs)
> are:
> * The fundamental issue where the table spec depends on/refers to the view
> spec (be

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Micah Kornfield
>
> However, this still does not address the semantic issue which is more
> fundamental in my opinion. The Iceberg table spec is not aware of catalog
> table identifiers and this use will be the first break of this abstraction.


IIUC, based on Jan's comments, we are not going to modify the table
specification.  I thought the state map was effectively opaque metadata
from the table specification perspective?  If this is the case I feel like
that is OK and not a blocker, I think by their nature as already discussed
MVs need catalog information to function properly and the choice to put
catalog information into the table metadata is pragmatic and preserves
other desirable properties.  It might be a more important point if we want
to update the table specification (IMO, I still think it would probably be
OK).

On a side note, it does not address the denormalization issue either if we
> ever want to introduce the lineage in the view as a nice-to-have.


I think if lineage is introduced to the View metadata, it should only hold
direct dependencies for the reasons already discussed. IMO, I think the
potential overlap is OK as they serve two different purposes.

Cheers,
Micah





On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa 
wrote:

> That is right. I agree that in the case of using catalog identifiers in
> state information, using them in lineage information would be a
> nice-to-have and not a requirement.
>
> However, this still does not address the semantic issue which is more
> fundamental in my opinion. The Iceberg table spec is not aware of catalog
> table identifiers and this use will be the first break of this abstraction.
>
> On a side note, it does not address the denormalization issue either if we
> ever want to introduce the lineage in the view as a nice-to-have.
>
> Thanks,
> Walaa.
>
>
> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul 
> wrote:
>
>> Hi Walaa,
>>
>> I would argue that for the refresh operation the query engine has to
>> parse the query and then somehow execute it. For a full refresh it will
>> directly execute the query and for a incremental refresh it will execute a
>> modified version. Therefore it has to fully expand the query tree.
>>
>> Best wishes,
>>
>> Jan
>>
>> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa :
>>
>> Thanks Jan for the summary.
>>
>> For this point:
>>
>> > For a refresh operation the query engine has to parse the SQL and fully
>> expand the lineage with it's children anyway.  So the lineage is not
>> strictly required.
>>
>> If the lineage is provided at creation time by the respective engine, the
>> refresh operation does not need to parse the SQL, correct?
>>
>> Thanks,
>> Walaa.
>>
>> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
>> wrote:
>>
>> As the table I created is not properly shown in the mailing list I'll
>> reformat the summary of the different drawbacks again:
>>
>> Drawbacks of (no lineage, refresh-state key = identifier):
>>
>> - introduces catalog identifiers into table metadata (#4)
>> - query engine has to expand lineage at refresh time (happens anyway)
>>
>> Drawbacks of (normalized lineage, refresh-state key = uuid):
>>
>> - recursive calls to catalog to expand lineage at read time (#2)
>> - fragile by requiring child views to have lineage field
>>
>> Drawbacks of (denormalized lineage, refresh-state key = uuid):
>>
>> - update of materialized view version required if child view is updated
>> (#5)
>> On 16.08.24 09:17, Jan Kaul wrote:
>>
>> Hi,
>>
>> Thanks Micah for clearly stating the requirements. I think this gives
>> better clarity for the discussion.
>>
>> It seems like we don't have a solution that satisfies all requirements at
>> once. So we would need to choose which has the fewest drawbacks.
>>
>> I would like to summarize the different drawbacks that came up in the
>> discussion.
>> no lineage
>> + refresh-state key = identifier
>> normalized lineage
>> + refresh-state key = uuid
>> denormalized lineage
>> + refresh-state key = uuid
>> - introduces catalog identifiers into table metadata (#4)
>> - query engine has to expand lineage at refresh time (happens anyway)
>> - recursive calls to catalog to expand lineage at read time (#2)
>> - fragile by requiring child views to have lineage field
>> - update of materialized view version required if child view is updated
>> (#5)
>>
>> With identifiers as the refresh-state keys, the lineage is not strictly
>> required and becomes an orthogonal proposal. That's why I left it out if
>> the comparison.
>>
>> In my opinion introducing catalog identifiers into the table metadata
>> (requirement #4) is the least significant drawback as it is not a technical
>> reason but more about semantics. Especially as the identifiers are not
>> introduced into the table spec but are rather stored in the snapshot
>> summary. That's why I'm in favor of using the catalog identifiers as the
>> refresh-state keys.
>>
>> Regarding your last point Walaa:
>>
>> The option of using catalog identifiers in the st

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
Thanks Micah, for the latter, I meant the type of denormalization of
repeating a 3-part name as opposed to using an ID.

On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield 
wrote:

> However, this still does not address the semantic issue which is more
>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>> table identifiers and this use will be the first break of this abstraction.
>
>
> IIUC, based on Jan's comments, we are not going to modify the table
> specification.  I thought the state map was effectively opaque metadata
> from the table specification perspective?  If this is the case I feel like
> that is OK and not a blocker, I think by their nature as already discussed
> MVs need catalog information to function properly and the choice to put
> catalog information into the table metadata is pragmatic and preserves
> other desirable properties.  It might be a more important point if we want
> to update the table specification (IMO, I still think it would probably be
> OK).
>
> On a side note, it does not address the denormalization issue either if we
>> ever want to introduce the lineage in the view as a nice-to-have.
>
>
> I think if lineage is introduced to the View metadata, it should only hold
> direct dependencies for the reasons already discussed. IMO, I think the
> potential overlap is OK as they serve two different purposes.
>
> Cheers,
> Micah
>
>
>
>
>
> On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> That is right. I agree that in the case of using catalog identifiers in
>> state information, using them in lineage information would be a
>> nice-to-have and not a requirement.
>>
>> However, this still does not address the semantic issue which is more
>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>> table identifiers and this use will be the first break of this abstraction.
>>
>> On a side note, it does not address the denormalization issue either if
>> we ever want to introduce the lineage in the view as a nice-to-have.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul 
>> wrote:
>>
>>> Hi Walaa,
>>>
>>> I would argue that for the refresh operation the query engine has to
>>> parse the query and then somehow execute it. For a full refresh it will
>>> directly execute the query and for a incremental refresh it will execute a
>>> modified version. Therefore it has to fully expand the query tree.
>>>
>>> Best wishes,
>>>
>>> Jan
>>>
>>> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa >> >:
>>>
>>> Thanks Jan for the summary.
>>>
>>> For this point:
>>>
>>> > For a refresh operation the query engine has to parse the SQL and
>>> fully expand the lineage with it's children anyway.  So the lineage is not
>>> strictly required.
>>>
>>> If the lineage is provided at creation time by the respective engine,
>>> the refresh operation does not need to parse the SQL, correct?
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
>>> wrote:
>>>
>>> As the table I created is not properly shown in the mailing list I'll
>>> reformat the summary of the different drawbacks again:
>>>
>>> Drawbacks of (no lineage, refresh-state key = identifier):
>>>
>>> - introduces catalog identifiers into table metadata (#4)
>>> - query engine has to expand lineage at refresh time (happens anyway)
>>>
>>> Drawbacks of (normalized lineage, refresh-state key = uuid):
>>>
>>> - recursive calls to catalog to expand lineage at read time (#2)
>>> - fragile by requiring child views to have lineage field
>>>
>>> Drawbacks of (denormalized lineage, refresh-state key = uuid):
>>>
>>> - update of materialized view version required if child view is updated
>>> (#5)
>>> On 16.08.24 09:17, Jan Kaul wrote:
>>>
>>> Hi,
>>>
>>> Thanks Micah for clearly stating the requirements. I think this gives
>>> better clarity for the discussion.
>>>
>>> It seems like we don't have a solution that satisfies all requirements
>>> at once. So we would need to choose which has the fewest drawbacks.
>>>
>>> I would like to summarize the different drawbacks that came up in the
>>> discussion.
>>> no lineage
>>> + refresh-state key = identifier
>>> normalized lineage
>>> + refresh-state key = uuid
>>> denormalized lineage
>>> + refresh-state key = uuid
>>> - introduces catalog identifiers into table metadata (#4)
>>> - query engine has to expand lineage at refresh time (happens anyway)
>>> - recursive calls to catalog to expand lineage at read time (#2)
>>> - fragile by requiring child views to have lineage field
>>> - update of materialized view version required if child view is updated
>>> (#5)
>>>
>>> With identifiers as the refresh-state keys, the lineage is not strictly
>>> required and becomes an orthogonal proposal. That's why I left it out if
>>> the comparison.
>>>
>>> In my opinion introducing catalog identifiers into the table metadata
>>> (requirement #4) is the least significant drawback as it is not a 

Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-16 Thread Sung Yun
Hi folks!

We are 1 binding vote short of accepting this release candidate. The
verification steps are very easy to follow and can be found here:
https://py.iceberg.apache.org/verify-release/

Thank you all again for testing and verifying the release!

Sung

On Fri, Aug 16, 2024 at 12:39 PM Chinmay Bhat  wrote:

> +1 (non-binding)
>
> - Verified signatures, checksums, license
> - Ran unit tests & test-coverage with Python 3.9.19
>
> Best,
> Chinmay
>
> On Fri, Aug 16, 2024 at 10:02 PM Daniel Weeks  wrote:
>
>> Thanks Sung!
>>
>> I agree with the comments that this doesn't require a new RC.
>>
>> +1 (binding)
>>
>> Verified sigs/sums/license/build/test with Python 3.11.9
>>
>> Thanks,
>> -Dan
>>
>> On Thu, Aug 15, 2024 at 3:34 PM Sung Yun  wrote:
>>
>>> Hi Daniel, thank you very much for testing the installation thoroughly
>>> and reporting these issues.
>>>
>>> We make note of the supported Python versions using the PyPi classifiers
>>> ,
>>> but I agree that we should include the restriction in the installation
>>> requirements as well.
>>>
>>> Here's a PR with an attempt to fix the this issue, and the issue with
>>> docutils (0.21.post1):
>>> https://github.com/apache/iceberg-python/pull/1067
>>>
>>> On Thu, Aug 15, 2024 at 5:35 PM Daniel Weeks  wrote:
>>>
 I ran into a couple issues while trying to verify the release.

 The first appears to be a transient issue (we ran into something
 similar in the 0.6.1 release but I was able to install later).

 Package docutils (0.21.post1) not found.
 make: *** [install-dependencies] Error 1

 The second issue is more concerning to me because I can't install
 dependencies with Python 3.12.4 as I get the following:

   - Installing numpy (1.24.4): Failed

   ChefBuildError

   Backend 'setuptools.build_meta:__legacy__' is not available.

   Cannot import 'setuptools.build_meta'

   at venv/lib/python3.12/site-packages/poetry/installation/chef.py:164
 in _prepare
   160│
   161│ error =
 ChefBuildError("\n\n".join(message_parts))
   162│
   163│ if error is not None:
 → 164│ raise error from None
   165│
   166│ return path
   167│
   168│ def _prepare_sdist(self, archive: Path, destination:
 Path | None = None) -> Path:

 Note: This error originates from the build backend, and is likely not a
 problem with poetry but with numpy (1.24.4) not supporting PEP 517 builds.
 You can verify this by running 'pip wheel --no-cache-dir --use-pep517
 "numpy (==1.24.4)"'.

 I was able to verify everything with 3.11 however, but I haven't seen
 anything that would indicate we don't support 3.12.x

 -Dan

 On Wed, Aug 14, 2024 at 7:14 PM Kevin Liu 
 wrote:

> +1 (non-binding)
> Verified signatures/checksums/licenses. Ran unit and integration tests.
>
> On Thu, Aug 15, 2024 at 2:42 AM Fokko Driesprong 
> wrote:
>
>> +1 (binding)
>>
>> Thanks Sung for running this 🙌
>>
>> - Validated signatures/checksums/license
>> - Ran some basic tests (3.10)
>>
>> Kind regards,
>> Fokko
>>
>> Op wo 14 aug 2024 om 19:57 schreef André Luis Anastácio
>> :
>>
>>>
>>>- validated signatures and checksums
>>>
>>>
>>>- checked license
>>>
>>>
>>>- ran tests and test-coverage with Python 3.9.12
>>>
>>>
>>> +1 (non-binding)
>>>
>>> André Anastácio
>>>
>>> On Tuesday, August 13th, 2024 at 10:19 PM, Sung Yun <
>>> sungwy...@gmail.com> wrote:
>>>
>>> Hi Everyone,
>>>
>>> I propose that we release the following RC as the official PyIceberg
>>> 0.7.1 release.
>>>
>>> A summary of the high level features:
>>>
>>> * Fix `delete` to trace existing manifests when a data file is
>>> partially rewritten
>>> 
>>> * Fix 'to_arrow_batch_reader' to respect the limit input arg
>>> 
>>> * Fix correctness of applying positional deletes on Merge-On-Read
>>> tables 
>>> * Fix overwrite when filtering data
>>> 
>>> * Bug fix for deletes across multiple partition specs on partition
>>> evolution 
>>> * Fix evolving the table and writing in the same transaction
>>> 
>>> * Fix scans when result is empty
>>> 
>>> * Fix ListNamespace response in REST C

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Micah Kornfield
>
> That being said, I think the most important consideration for now is where
> are the current maintainers / contributors to the variant type. If most of
> them are already PMC members / committers on a project, it becomes a bit
> easier. Otherwise if there isn't much overlap with a project's existing
> governance, I worry there could be a bit of friction. How many active
> contributors are there from Iceberg? And how about from Arrow?


I think this is the key question. What are the requirements around
governance?  I've seen some tangential messaging here but I'm not clear on
what everyone expects.

I think for a lot of the other concerns my view is that the exact project
does not really matter (and choosing a project with mature cross language
testing infrastructure or committing to building it is critical). IIUC we
are talking about following artifacts:

1.  A stand alone specification document (this can be hosted anyplace)
2.  A set of language bindings with minimal dependencies can be consumed
downstream (again, as long as dependencies are managed carefully any
project can host these)
3.  Potential integration where appropriate into file format libraries to
support shredding (but as of now this is being bypassed by using
conventions anyways).  My impression is that at least for Parquet there has
been a proliferation of vectorized readers across different projects, so
I'm not clear how much standardization in parquet-java could help here.

To respond to some other questions:

Arrow is not used as Spark's in-memory model, nor Trino and others so those
> existing relationships aren't there. I also worry that differences in
> approaches would make it difficult later on.


While Arrow is not in the core memory model, for Spark I believe it is
still used for IPC for things like Java<->Python. Trino also consumes Arrow
libraries today to support things like Snowflake/Bigquery federation. But I
think this is minor because as mentioned above I think the functional
libraries would be relatively stand-alone.

Do we think it could be introduced as a canonical extension arrow type?


 I believe it can be, I think there are probably different layouts that can
be supported:

1.  A struct with two variable width bytes columns (metadata and value data
are stored separately and each entry has a 1:1 relationship).
2.  Shredded (shredded according to the same convention as parquet), I
would need to double check but I don't think Arrow would have problems here
but REE would likely be required to make this efficient (i.e. sparse value
support is important).

In both cases the main complexity is providing the necessary functions for
manipulation.

Thanks,
Micah







On Fri, Aug 16, 2024 at 3:58 PM Will Jones  wrote:

> In being more engine and format agnostic, I agree the Arrow project might
> be a good host for such a specification. It seems like we want to move away
> from hosting in Spark to make it engine agnostic. But moving into Iceberg
> might make it less format agnostic, as I understand multiple formats might
> want to implement this. I'm not intimately familiar with the state of this,
> but I believe Delta Lake would like to be aligned with the same format as
> Iceberg. In addition, the Lance format (which I work on), will eventually
> be interesting as well. It seems equally bad to me to attach this
> specification to a particular table format as it does a particular query
> engine.
>
> That being said, I think the most important consideration for now is where
> are the current maintainers / contributors to the variant type. If most of
> them are already PMC members / committers on a project, it becomes a bit
> easier. Otherwise if there isn't much overlap with a project's existing
> governance, I worry there could be a bit of friction. How many active
> contributors are there from Iceberg? And how about from Arrow?
>
> BTW, I'd add I'm interested in helping develop an Arrow extension type for
> the binary variant type. I've been experimenting with a DataFusion
> extension that operates on this [1], and already have some ideas on how
> such an extension type might be defined. I'm not yet caught up on the
> shredded specification, but I think having just the binary format would be
> beneficial for in-memory analytics, which are most relevant to Arrow. I'll
> be creating a seperate thread on the Arrow ML about this soon.
>
> Best,
>
> Will Jones
>
> [1]
> https://github.com/datafusion-contrib/datafusion-functions-variant/issues
>
>
> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu  wrote:
>
> > + dev@arrow
> >
> > Thanks for all the valuable suggestions! I am inclined to Micah's idea
> that
> > Arrow might be a better host compared to Parquet.
> >
> > To give more context, I am taking the initiative to add the geometry type
> > to both Parquet and ORC. I'd like to do the same thing for variant type
> in
> > that variant type is engine and file format agnostic. This does mean that
> > Parquet might not be the neutral pl

Re: [DISCUSS] Guidelines for committing PRs

2024-08-16 Thread Micah Kornfield
Hi Walaa,

> For the former, we could talk about avoiding conflict of interest as a way
> of "maintaining trust". For the latter, we can state some examples that
> clearly reflect conflict of interest with no ambiguity. For example, a
> committer merging a large change that received minimal discussion and
> review while working for the same employer as the PR author can be an
> indication of conflict of interest . On the other hand, an author and a
> committer merely working for the same employer does not necessarily
> constitute conflict of interest if proper process was followed and
> sufficient stakeholders were included in the discussion.


To avoid confusion, I've removed all discussion on conflicts of interest.


> For the last two paragraphs, I thought we could just clarify them a bit.
> All they need may be a bit of untangling to directly state what needs a
> discussion vs a vote vs an IIP. Current flow starts with the committer's
> opinion, then a list of exceptions that require some (or all?) of the
> above. Further, it was not clear if the exception is to committer's
> ability to merge or something else, since the exceptions are stated as
> "There are several exceptions to this process:", but it was not clear what
> process/which part of it. Hence, being more direct could simplify parsing
> these two paragraphs.


I have tried to remove the ambiguity on the "this process".  For anything
not addressed it would be helpful if you could make specific
recommendations on the PR.

Thanks,
Micah

On Mon, Aug 12, 2024 at 11:18 PM Walaa Eldin Moustafa 
wrote:

> I think the issue with the first paragraph is about:
>
> 1- The perceived contradiction between a) trusting committers to act in
> the best interest of the project and b) simultaneously providing specific
> guidelines on how to act (e.g., by avoiding conflicts of interest).
>
> 2- The specific examples given for the conflict of interest, which if
> applied generically, could lead to unnecessary avoidance of reviewing.
>
> For the former, we could talk about avoiding conflict of interest as a way
> of "maintaining trust". For the latter, we can state some examples that
> clearly reflect conflict of interest with no ambiguity. For example, a
> committer merging a large change that received minimal discussion and
> review while working for the same employer as the PR author can be an
> indication of conflict of interest . On the other hand, an author and a
> committer merely working for the same employer does not necessarily
> constitute conflict of interest if proper process was followed and
> sufficient stakeholders were included in the discussion. How about
> something like this:
>
> Committers are entrusted to act in the best interest of the project, and
> part of maintaining this trust involves managing potential conflicts of
> interest. While there is no exhaustive definition of what constitutes a
> conflict of interest, it is important to recognize situations that could
> lead to that perception. For instance, a conflict of interest might be more
> evident if a committer approves and merges a substantial change that has
> received minimal discussion or review, particularly when there is a close
> professional relationship between the committer and the author. This
> scenario could be interpreted as providing preferential treatment, which
> can undermine the integrity of the project. However, merely working for the
> same employer as the author does not, by itself, create a conflict of
> interest—especially when proper processes are followed, and a sufficient
> number of stakeholders are involved in the discussion and review. The focus
> should be on ensuring transparency and broad input to maintain the trust
> essential for the project's success.
>
> For the last two paragraphs, I thought we could just clarify them a bit.
> All they need may be a bit of untangling to directly state what needs a
> discussion vs a vote vs an IIP. Current flow starts with the committer's
> opinion, then a list of exceptions that require some (or all?) of the
> above. Further, it was not clear if the exception is to committer's ability
> to merge or something else, since the exceptions are stated as "There are
> several exceptions to this process:", but it was not clear what
> process/which part of it. Hence, being more direct could simplify parsing
> these two paragraphs.
>
> Thanks,
> Walaa.
>


Re: [DISCUSS] Guidelines for committing PRs

2024-08-16 Thread Walaa Eldin Moustafa
Thanks Micha. It is clearer now. I have left some comments. Let us continue
on the PR.

On Fri, Aug 16, 2024 at 5:39 PM Micah Kornfield 
wrote:

> Hi Walaa,
>
>> For the former, we could talk about avoiding conflict of interest as a
>> way of "maintaining trust". For the latter, we can state some examples that
>> clearly reflect conflict of interest with no ambiguity. For example, a
>> committer merging a large change that received minimal discussion and
>> review while working for the same employer as the PR author can be an
>> indication of conflict of interest . On the other hand, an author and a
>> committer merely working for the same employer does not necessarily
>> constitute conflict of interest if proper process was followed and
>> sufficient stakeholders were included in the discussion.
>
>
> To avoid confusion, I've removed all discussion on conflicts of interest.
>
>
>> For the last two paragraphs, I thought we could just clarify them a bit.
>> All they need may be a bit of untangling to directly state what needs a
>> discussion vs a vote vs an IIP. Current flow starts with the committer's
>> opinion, then a list of exceptions that require some (or all?) of the
>> above. Further, it was not clear if the exception is to committer's
>> ability to merge or something else, since the exceptions are stated as
>> "There are several exceptions to this process:", but it was not clear what
>> process/which part of it. Hence, being more direct could simplify parsing
>> these two paragraphs.
>
>
> I have tried to remove the ambiguity on the "this process".  For anything
> not addressed it would be helpful if you could make specific
> recommendations on the PR.
>
> Thanks,
> Micah
>
> On Mon, Aug 12, 2024 at 11:18 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> I think the issue with the first paragraph is about:
>>
>> 1- The perceived contradiction between a) trusting committers to act in
>> the best interest of the project and b) simultaneously providing specific
>> guidelines on how to act (e.g., by avoiding conflicts of interest).
>>
>> 2- The specific examples given for the conflict of interest, which if
>> applied generically, could lead to unnecessary avoidance of reviewing.
>>
>> For the former, we could talk about avoiding conflict of interest as a
>> way of "maintaining trust". For the latter, we can state some examples that
>> clearly reflect conflict of interest with no ambiguity. For example, a
>> committer merging a large change that received minimal discussion and
>> review while working for the same employer as the PR author can be an
>> indication of conflict of interest . On the other hand, an author and a
>> committer merely working for the same employer does not necessarily
>> constitute conflict of interest if proper process was followed and
>> sufficient stakeholders were included in the discussion. How about
>> something like this:
>>
>> Committers are entrusted to act in the best interest of the project, and
>> part of maintaining this trust involves managing potential conflicts of
>> interest. While there is no exhaustive definition of what constitutes a
>> conflict of interest, it is important to recognize situations that could
>> lead to that perception. For instance, a conflict of interest might be more
>> evident if a committer approves and merges a substantial change that has
>> received minimal discussion or review, particularly when there is a close
>> professional relationship between the committer and the author. This
>> scenario could be interpreted as providing preferential treatment, which
>> can undermine the integrity of the project. However, merely working for the
>> same employer as the author does not, by itself, create a conflict of
>> interest—especially when proper processes are followed, and a sufficient
>> number of stakeholders are involved in the discussion and review. The focus
>> should be on ensuring transparency and broad input to maintain the trust
>> essential for the project's success.
>>
>> For the last two paragraphs, I thought we could just clarify them a bit.
>> All they need may be a bit of untangling to directly state what needs a
>> discussion vs a vote vs an IIP. Current flow starts with the committer's
>> opinion, then a list of exceptions that require some (or all?) of the
>> above. Further, it was not clear if the exception is to committer's ability
>> to merge or something else, since the exceptions are stated as "There are
>> several exceptions to this process:", but it was not clear what
>> process/which part of it. Hence, being more direct could simplify parsing
>> these two paragraphs.
>>
>> Thanks,
>> Walaa.
>>
>


Re: [VOTE] Release Apache PyIceberg 0.7.1rc2

2024-08-16 Thread Honah J.
+1 (binding)

- Validated signatures/checksum/license
- Ran test with Python 3.11.9

Sorry for being late. Thanks Sung for running the release! Thanks everyone
for contributing and testing!

Best regards,
Honah

On Fri, Aug 16, 2024 at 5:04 PM Sung Yun  wrote:

> Hi folks!
>
> We are 1 binding vote short of accepting this release candidate. The
> verification steps are very easy to follow and can be found here:
> https://py.iceberg.apache.org/verify-release/
>
> Thank you all again for testing and verifying the release!
>
> Sung
>
> On Fri, Aug 16, 2024 at 12:39 PM Chinmay Bhat  wrote:
>
>> +1 (non-binding)
>>
>> - Verified signatures, checksums, license
>> - Ran unit tests & test-coverage with Python 3.9.19
>>
>> Best,
>> Chinmay
>>
>> On Fri, Aug 16, 2024 at 10:02 PM Daniel Weeks  wrote:
>>
>>> Thanks Sung!
>>>
>>> I agree with the comments that this doesn't require a new RC.
>>>
>>> +1 (binding)
>>>
>>> Verified sigs/sums/license/build/test with Python 3.11.9
>>>
>>> Thanks,
>>> -Dan
>>>
>>> On Thu, Aug 15, 2024 at 3:34 PM Sung Yun  wrote:
>>>
 Hi Daniel, thank you very much for testing the installation thoroughly
 and reporting these issues.

 We make note of the supported Python versions using the PyPi
 classifiers
 ,
 but I agree that we should include the restriction in the installation
 requirements as well.

 Here's a PR with an attempt to fix the this issue, and the issue with
 docutils (0.21.post1):
 https://github.com/apache/iceberg-python/pull/1067

 On Thu, Aug 15, 2024 at 5:35 PM Daniel Weeks  wrote:

> I ran into a couple issues while trying to verify the release.
>
> The first appears to be a transient issue (we ran into something
> similar in the 0.6.1 release but I was able to install later).
>
> Package docutils (0.21.post1) not found.
> make: *** [install-dependencies] Error 1
>
> The second issue is more concerning to me because I can't install
> dependencies with Python 3.12.4 as I get the following:
>
>   - Installing numpy (1.24.4): Failed
>
>   ChefBuildError
>
>   Backend 'setuptools.build_meta:__legacy__' is not available.
>
>   Cannot import 'setuptools.build_meta'
>
>   at venv/lib/python3.12/site-packages/poetry/installation/chef.py:164
> in _prepare
>   160│
>   161│ error =
> ChefBuildError("\n\n".join(message_parts))
>   162│
>   163│ if error is not None:
> → 164│ raise error from None
>   165│
>   166│ return path
>   167│
>   168│ def _prepare_sdist(self, archive: Path, destination:
> Path | None = None) -> Path:
>
> Note: This error originates from the build backend, and is likely not
> a problem with poetry but with numpy (1.24.4) not supporting PEP 517
> builds. You can verify this by running 'pip wheel --no-cache-dir
> --use-pep517 "numpy (==1.24.4)"'.
>
> I was able to verify everything with 3.11 however, but I haven't seen
> anything that would indicate we don't support 3.12.x
>
> -Dan
>
> On Wed, Aug 14, 2024 at 7:14 PM Kevin Liu 
> wrote:
>
>> +1 (non-binding)
>> Verified signatures/checksums/licenses. Ran unit and integration
>> tests.
>>
>> On Thu, Aug 15, 2024 at 2:42 AM Fokko Driesprong 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> Thanks Sung for running this 🙌
>>>
>>> - Validated signatures/checksums/license
>>> - Ran some basic tests (3.10)
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op wo 14 aug 2024 om 19:57 schreef André Luis Anastácio
>>> :
>>>

- validated signatures and checksums


- checked license


- ran tests and test-coverage with Python 3.9.12


 +1 (non-binding)

 André Anastácio

 On Tuesday, August 13th, 2024 at 10:19 PM, Sung Yun <
 sungwy...@gmail.com> wrote:

 Hi Everyone,

 I propose that we release the following RC as the official
 PyIceberg 0.7.1 release.

 A summary of the high level features:

 * Fix `delete` to trace existing manifests when a data file is
 partially rewritten
 
 * Fix 'to_arrow_batch_reader' to respect the limit input arg
 
 * Fix correctness of applying positional deletes on Merge-On-Read
 tables 
 * Fix overwrite when filtering data
 

Re: [VOTE] Release Apache Iceberg Rust 0.3.0 RC1

2024-08-16 Thread Renjie Liu
+1 (binding)

[*] Download links are valid.
[*] Checksums and signatures Seems we miss how to verify page, and I follow
instructions here:
https://iceberg.apache.org/how-to-release/#validating-a-source-release-candidate
[*] LICENSE/NOTICE files exist
[*] No unexpected binary files
[ ] All source files have ASF headers I didn't check manually, but we have
ci working, so I skipped it.
[*] Can compile from source: Running `make test` and it works on mac. We
have cis working on linux, so didn't run on linux manually.

On Fri, Aug 16, 2024 at 2:07 PM Christian Thiel
 wrote:

> +1 (non-binding)
>
>
>
> *From: *Xuanwo 
> *Date: *Wednesday, 14. August 2024 at 17:58
> *To: *dev@iceberg.apache.org 
> *Subject: *[VOTE] Release Apache Iceberg Rust 0.3.0 RC1
>
> Hello, Apache Iceberg Rust Community,
>
> This is a call for a vote to release Apache Iceberg rust version 0.3.0.
>
> The tag to be voted on is 0.3.0.
>
> The release candidate:
>
> https://dist.apache.org/repos/dist/dev/iceberg/iceberg-rust-0.3.0-rc.1/
>
> Keys to verify the release candidate:
>
> https://downloads.apache.org/iceberg/KEYS
>
> Git tag for the release:
>
> https://github.com/apache/iceberg-rust/releases/tag/v0.3.0-rc.1
>
> Please download, verify, and test.
>
> The VOTE will be open for at least 72 hours and until the necessary
> number of votes are reached.
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> To learn more about Apache Iceberg Rust, please see
> https://rust.iceberg.apache.org/
>
> Checklist for reference:
>
> [ ] Download links are valid.
> [ ] Checksums and signatures.
> [ ] LICENSE/NOTICE files exist
> [ ] No unexpected binary files
> [ ] All source files have ASF headers
> [ ] Can compile from source
>
> More detailed checklist please refer to:
> https://github.com/apache/iceberg-rust/tree/main/scripts
>
> To compile from source, please refer to:
> https://github.com/apache/iceberg-rust/blob/main/CONTRIBUTING.md
>
> Here is a Python script in release to help you verify the release
> candidate:
>
> ./scripts/verify.py
>
> Thanks
>
> Xuanwo
>
> https://xuanwo.io/
>


Re: [DISCUSS] Row Lineage Proposal

2024-08-16 Thread Péter Váry
Hi Russell,

As discussed offline, this would be very hard to implement with the current
Flink CDC write strategies. I think this is true for every streaming
writers.

For tracking the previous version of the row, the streaming writer would
need to scan the table. It needs to be done for every record to find the
previous version. This could be possible if the data would be stored in a
way which supports fast queries on the primary key, like LSM Tree (see:
Paimon [1]), otherwise it would be prohibitively costly, and unfeasible for
higher loads. So adding a new storage strategy could be one solution.

Alternatively we might find a way for the compaction to update the lineage
fields. We could provide a way to link the equality deletes to the new rows
which updated them during write, then on compaction we could update the
lineage fields based on this info.

Is there any better ideas with Spark streaming which we can adopt?

Thanks,
Peter

[1] - https://paimon.apache.org/docs/0.8/

On Sat, Aug 17, 2024, 01:06 Russell Spitzer 
wrote:

> Hi Y'all,
>
> We've been working on a new proposal to add Row Lineage to Iceberg in the
> V3 Spec. The general idea is to give every row a unique identifier as well
> as a marker of what version of the row it is. This should let us build a
> variety of features related to CDC, Incremental Processing and Audit
> Logging. If you are interested please check out the linked proposal below.
> This will require compliance from all engines to be really useful so It's
> important we come to consensus on whether or not this is possible.
>
>
> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing
>
>
> Thank you for your consideration,
> Russ
>