Re: [Discuss] Iceberg community maintaining the docker images

2024-10-10 Thread Ajantha Bhat
Thanks for the discussions.

I will just focus on Docker image of the REST catalog TCK first.

These are related PRs for the same.
https://github.com/apache/iceberg/pull/11279
https://github.com/apache/iceberg/pull/11283

We still need Apache infra help for publishing the image in the Apache
docker hub account.
I hope one of the PMC can help me with this.

Image name can be "
*apache/iceberg-rest-adapter"*
- Ajantha



On Fri, Oct 11, 2024 at 2:17 AM rdb...@gmail.com  wrote:

> I was specifically replying to this suggestion to add docker images for
> Trino and Spark:
>
> > I also envision the Iceberg community maintaining some quick-start
> Docker images, such as spark-iceberg-rest, Trino-iceberg-rest, among others.
>
> It sounds like we're mostly agreed that the Iceberg project itself isn't a
> good place to do that. As for an image that is for catalog implementations
> to test against, I think that's a good idea (supporting testing and
> validation).
>
> On Thu, Oct 10, 2024 at 10:56 AM Jean-Baptiste Onofré 
> wrote:
>
>> It's actually what I meant by REST Catalog docker image for test.
>>
>> Personally, I would not include any docker images in the Iceberg project
>> (but more in the "iceberg" ecosystem, which is different from the project
>> :)).
>>
>> However, if the community has a different view on that, no problem.
>>
>> Regards
>> JB
>>
>> On Thu, Oct 10, 2024 at 9:50 AM Daniel Weeks  wrote:
>>
>>> I think we should focus on the docker image for the test REST Catalog
>>> implementation.  This is somewhat different from the TCK since it's used by
>>> the python/rust/go projects for testing the client side of the REST
>>> specification.
>>>
>>> As for the quickstart/example type images, I'm open to discussing what
>>> makes sense here, but we should decouple that and other docker images from
>>> getting a test REST catalog image out.  (Seems like there's general
>>> consensus around that).
>>>
>>> -Dan
>>>
>>> On Thu, Oct 10, 2024 at 4:29 AM Ajantha Bhat 
>>> wrote:
>>>
 Yes, the PRs I mentioned are about running TCK as a docker container
 and keeping/maintaining that docker file in the Iceberg repo.

 I envisioned maintaining other docker images also because I am not sure
 about the roadmap of the ones in our quickstart
  (example:
 tabulario/spark-iceberg).

 Thanks,
 Ajantha

 On Thu, Oct 10, 2024 at 3:50 PM Jean-Baptiste Onofré 
 wrote:

> Hi
>
> I think there's context missing here.
>
> I agree with Ryan that Iceberg should not provide any docker image or
> runtime things (we had the same discussion about REST server).
>
> However, my understanding is that this discussion is also related to
> the REST TCK. The TCK validation run needs a runtime, and I remember a
> discussion we had with Daniel (running TCK as a docker container).
>
> Regards
> JB
>
> On Wed, Oct 9, 2024 at 2:20 PM rdb...@gmail.com 
> wrote:
>
>> I think it's important for a project to remain focused on its core
>> purpose, and I've always advocated for Iceberg to remain a library that 
>> is
>> easy to plug into other projects. I think that should be the guide here 
>> as
>> well. Aren't projects like Spark and Trino responsible for producing easy
>> to use Docker images of those environments? Why would the Iceberg project
>> build and maintain them?
>>
>> I would prefer not to be distracted by these things, unless we need
>> them for cases like supporting testing and validation of things that are
>> part of the core purpose of the project.
>>
>> On Tue, Oct 8, 2024 at 6:08 AM Ajantha Bhat 
>> wrote:
>>
>>> Hello everyone,
>>>
>>> Now that the test fixtures are in [1],we can create a runtime JAR
>>> for the REST catalog adapter [2] from the TCK.
>>> Following that, we can build and maintain the Docker image based on
>>> it [3].
>>>
>>> I also envision the Iceberg community maintaining some quick-start
>>> Docker images, such as spark-iceberg-rest, Trino-iceberg-rest, among 
>>> others.
>>>
>>> I've looked into other Apache projects, and it seems that Apache
>>> Infra can assist us with this process.
>>> As we have the option to publish Iceberg docker images under the
>>> Apache Docker Hub account.
>>>
>>> [image: image.png]
>>>
>>> I am more than willing to maintain this code, please find the PRs
>>> related to the same [2] & [3].
>>>
>>> Any suggestions on the same? contributions are welcome if we agree
>>> to maintain it.
>>>
>>> [1] https://github.com/apache/iceberg/pull/10908
>>> [2] https://github.com/apache/iceberg/pull/11279
>>> [3] https://github.com/apache/iceberg/pull/11283
>>>
>>> - Ajantha
>>>
>>


Re: [DISCUSS] Iceberg Rust Sync Meeting

2024-10-10 Thread Christian Thiel
+1 for rust sync. Thanks for the proposal Xuanwo. There are many open topics 
and alignment in the sync can help to clarify scopes and dependencies to move 
forward with iceberg-rust even faster.
Time is good for me.

Von: Kevin Liu 
Gesendet: Wednesday, October 9, 2024 4:47:57 PM
An: dev@iceberg.apache.org 
Betreff: Re: [DISCUSS] Iceberg Rust Sync Meeting

+1 on sync meeting for iceberg rust. I want to get involved and catch up on the 
recent developments. For reference, here's the doc we've been using for the 
pyiceberg sync 
https://docs.google.com/document/d/1oMKodaZJrOJjPfc8PDVAoTdl02eGQKHlhwuggiw7s9U

Best,
Kevin

On Wed, Oct 9, 2024 at 5:30 AM Xuanwo 
mailto:xua...@apache.org>> wrote:
Hi,

I'm starting this thread to explore the idea of hosting an Iceberg Rust Sync 
Meeting. In this meeting, we will discuss recent major changes, pending PR 
reviews, and features in development. It will offer a space for Iceberg Rust 
contributors to connect and become familiar with each other, helping us 
identify and remove contribution barriers to the best of our ability.

Details about this meeeting:

I suggest hosting our meeting at the same time of day, but one week earlier 
than the Iceberg Sync Meeting. For example, if the Iceberg Sync Meeting is 
scheduled for Thursday, October 24, 2024, from 00:00 to 01:00 GMT+8, the 
Iceberg Rust Sync Meeting would take place one week before, on Thursday, 
October 17, 2024, from 00:00 to 01:00 GMT+8.

I also suggest using the same Google Meet code (if possible) so we don't get 
confused.

These meetings will not be recorded, but I will take notes in a Google Doc, 
similar to what we do in the Iceberg Sync Meeting.

What are your thoughts? I'm open to other options as well.

Xuanwo

https://xuanwo.io/


Re: [VOTE] Table V3 Spec: Row Lineage

2024-10-10 Thread Yufei Gu
+1
Yufei


On Thu, Oct 10, 2024 at 3:47 PM Amogh Jahagirdar <2am...@gmail.com> wrote:

> +1, I've been reviewing this proposal/spec change for a bit and I think
> it's in a good state for the community to work on an implementation.
>
> Thanks Russell for driving this!
>
> On Thu, Oct 10, 2024 at 3:31 PM Jack Ye  wrote:
>
>> +1, overall agree that we should add this!
>>
>> Best,
>> Jack Ye
>>
>> On Thu, Oct 10, 2024 at 1:43 PM Daniel Weeks  wrote:
>>
>>> +1
>>>
>>> Thanks Russell!
>>>
>>> On Thu, Oct 10, 2024 at 6:57 AM Eduard Tudenhöfner <
>>> etudenhoef...@apache.org> wrote:
>>>
 I left a few comments on the proposal but I'm overall +1 on the proposal

 On Thu, Oct 10, 2024 at 12:08 PM Jean-Baptiste Onofré 
 wrote:

> +1
>
> I did a review on the proposal and it looks good to me.
>
> Regards
> JB
>
> On Tue, Oct 8, 2024 at 3:55 PM Russell Spitzer
>  wrote:
> >
> > Hi Y'all!
> >
> > I think we are more or less in agreement on adding Row Lineage to
> the spec apart from a few details which may change a bit during
> implementation. Because of this, I'd like to call for an overall vote on
> whether or not Row-Lineage as described in  PR 11130 can be added to the
> spec.
> >
> > I'll note this is basically giving a thumbs up for reviewers and
> implementers to go ahead with the pull-request and acknowledging that you
> support the direction this proposal is going. I do think we'll probably 
> dig
> a few things out when we write the reference implementation, but I think 
> in
> general we have defined the required behaviors we want to see.
> >
> > Please vote in the next 72 hours
> >
> > [ ] +1, commit the proposed spec changes
> > [ ] -0
> > [ ] -1, do not make these changes because . . .
> >
> >
> > Thanks everyone,
> >
> > Russ
>



Re: Iceberg View Spec Improvements

2024-10-10 Thread Walaa Eldin Moustafa
Hi Dan,

I think there are a few questions that we should solve to decide the path
forward:

** Does the current spec contain implicit assumptions?*
I think the answer is yes. I think this is also what Ryan indicated here
[1].

** Do these implicit assumptions make it difficult to adopt the spec or
evolve it in the correct way?*
I think the answer is yes as well. MV design discussions became quite
complicated because most contributors had a different understanding of the
spec compared to what it encodes as implicit assumptions (see this thread
for an example [2] -- there are a few more). This unaligned understanding
could possibly lead to inaccurate designs and potentially result in
unneeded further constraints or unneeded engineering complexity.

** What are the implicit assumptions (in an ambiguous way)?*
I do not think the answer is clear to everyone, even at this point. There
have been a few variations of those assumptions in this thread alone. I
think we should converge on a clear set of assumptions for everyone's
consumption.

** Should we add the assumptions explicitly to the spec?*
I think we definitely should. Adoption or extension of the spec will be
quite difficult if the assumptions are not clearly stated and are
interpreted differently by different contributors.

Would be great to hear the community's feedback on whether they agree with
the answers to the above questions.

[1] https://lists.apache.org/thread/s1hjnc163ny76smv2l0t2sxxn93s4595
[2] https://lists.apache.org/thread/0wzowd15328rnwvotzcoo4jrdyrzlx91

Thanks,
Walaa.


Re: [VOTE] Table V3 Spec: Row Lineage

2024-10-10 Thread Amogh Jahagirdar
+1, I've been reviewing this proposal/spec change for a bit and I think
it's in a good state for the community to work on an implementation.

Thanks Russell for driving this!

On Thu, Oct 10, 2024 at 3:31 PM Jack Ye  wrote:

> +1, overall agree that we should add this!
>
> Best,
> Jack Ye
>
> On Thu, Oct 10, 2024 at 1:43 PM Daniel Weeks  wrote:
>
>> +1
>>
>> Thanks Russell!
>>
>> On Thu, Oct 10, 2024 at 6:57 AM Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>>
>>> I left a few comments on the proposal but I'm overall +1 on the proposal
>>>
>>> On Thu, Oct 10, 2024 at 12:08 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
 +1

 I did a review on the proposal and it looks good to me.

 Regards
 JB

 On Tue, Oct 8, 2024 at 3:55 PM Russell Spitzer
  wrote:
 >
 > Hi Y'all!
 >
 > I think we are more or less in agreement on adding Row Lineage to the
 spec apart from a few details which may change a bit during implementation.
 Because of this, I'd like to call for an overall vote on whether or not
 Row-Lineage as described in  PR 11130 can be added to the spec.
 >
 > I'll note this is basically giving a thumbs up for reviewers and
 implementers to go ahead with the pull-request and acknowledging that you
 support the direction this proposal is going. I do think we'll probably dig
 a few things out when we write the reference implementation, but I think in
 general we have defined the required behaviors we want to see.
 >
 > Please vote in the next 72 hours
 >
 > [ ] +1, commit the proposed spec changes
 > [ ] -0
 > [ ] -1, do not make these changes because . . .
 >
 >
 > Thanks everyone,
 >
 > Russ

>>>


Re: [PROPOSAL] Partially Loading Metadata - LoadTable V2

2024-10-10 Thread Haizhou Zhao
Thanks Eduard and Dan,

At this stage, my main goal is to check around the community whether this
problem is worth solving. If I can get sufficient feedback, or better, even
consensus from the community, then that lays down a good foundation to
further progress this thread. Implementation details are important, but, at
this stage, less important than knowing this is the right direction. I look
forward to hash out implementation details with folks from the community
should there be enough support on solving this problem.

That being said, if I have to throw out my two cents on the implementation
details now, then here it is:

@Eduard, I think the fundamental difference between us is that yours is
"LazilyLoading", while mine is "PartiallyLoading". I reasoned about it,
"LazilyLoading" actually introduces less intrusive changes to fundamental
contracts like "TableOperations" and "Table", yet still achieves what
"PartiallyLoading" aims to do - i.e., on the surface, the full metadata is
still there, but any field on the metadata could be lazily loaded, which
means it physically is not there until it is needed, which somehow is
partially loading a metadata. Em, so yeah, that makes a lot of sense. If I
misunderstood your implementation, let me know. I just downloaded your code
patch and started to play around.

@Dan, expanding the "refs" concept to more fields sounds great, but I worry
eventually we need to make changes at some level to the current "refs"
implementation. Because, ideally, we aim for a reusable/generic framework
to all these kind of list/maps fields on metadata - we know that
"snapshots", "metadata-log", "snapshot-log", "schemas" are growth factors
in the current version of the spec, but we might have more of such fields
in the future versions of the spec (er, hard to predict, probably
"lineage"?). And I think there will be changes when we convert a solution
only applicable to "snapshots" field into a generic solution, which could
be a spec change, could be a client change (might take time to finalize the
details there though).

@Eduard, @Dan feel free to comment. Welcome thoughts from the rest of the
community as well.

-Haizhou

On Thu, Oct 10, 2024 at 9:23 AM Daniel Weeks  wrote:

> Hey Haizhou,
>
> I think you've done a great job of capturing some of the metadata size
> related issues in the doc, but I would echo Eduard's comments that we
> should explore using the existing refs only loading first.  This may
> require adding similar functionality for schemas/logs if we think that is a
> major issue (we have run into cases where that is an issue, but there's
> also maintenance work going on to help address some of these issues).
>
> The current refs only approach does fall back to a full metadata load when
> committing, but that was largely due to the complexity of changing the
> TableMetadata implementation, not necessarily a limitation of the REST spec.
>
> Definitely something we should be exploring, but we might already have
> some approaches that we can build upon.
>
> -Dan
>
> On Thu, Oct 10, 2024 at 6:37 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> Hey Haizhou,
>>
>> thanks for working on that proposal. I think my main concern with the
>> current proposal is that it adds quite a lot of complexity at a bunch of
>> places, since you'd need to partially update *TableMetadata*.
>> Additionally, it requires a new endpoint.
>>
>> An alternative to that would be to do something similar to what we
>> already have in *TableMetadata*, where we lazily load *snapshots* when
>> needed. We could expand that approach to lazily load the full
>> *TableMetadata* from the server when necessary and always only show a
>> slim version of *TableMetadata*. I did such a POC a while ago, which can
>> be seen in
>> https://github.com/nastra/iceberg/commit/ae2c7768c6f37be2f86b575bfc4fe84429b22a0e.
>> That POC would need to be expanded so that it doesn't only do this for
>> snapshots, but also for other fields.
>> I believe the main fields that can get quite large over time are *snapshots
>> / metadata-log / snapshot-log / schemas*.
>>
>> Might be worth checking how much we could gain by using a lazy table
>> metadata supplier in this scenario, as that would reduce the required
>> complexity.
>>
>> Thanks,
>> Eduard
>>
>>
>>
>> On Thu, Oct 10, 2024 at 2:05 AM Haizhou Zhao 
>> wrote:
>>
>>> Hello Dev List,
>>>
>>>
>>> I want to bring this proposal to discussion:
>>>
>>>
>>>
>>> https://docs.google.com/document/d/1eXnT0ZiFvdm_Zvk6fLGT_UxVWO-HsiqVywqu1Uk8s7E/edit#heading=h.uad1lm906wz4
>>>
>>>
>>>
>>> It proposes a new LoadTable API (branded LoadTableV2 at the moment) on
>>> REST spec that allows partially loading table metadata. The motivation is
>>> to stabilize and optimize Spark write workloads, especially on Iceberg
>>> tables with big metadata (e.g. due to huge list of snapshot/metadata log,
>>> complicated schema, etc.). We want to leverage this proposal to reduce
>>> operational and monetary c

Re: Iceberg View Spec Improvements

2024-10-10 Thread Amogh Jahagirdar
 I took another pass over the view spec and I believe that representations
of identifiers and how resolution of references by engines should be
performed is clear. So from my perspective, at the moment we do not need to
change the view spec itself.

I do acknowledge though that practically there can be scenarios where
catalog names are inconsistent across environments and this has led to
confusion when developing the MV spec (I'm remembering based on last week's
community sync). There are some recommendations so that implementations can
address these inconsistencies in this thread already, but I don't think
adding some more complexity to the view spec via some form of
normalizing/mapping identifiers is worth it for these cases. I think in its
current state it's a sufficient model for developing MVs, and shouldn't
block progression on that.

I'm +1 on adding an "unsupported configurations" clarification though, it's
become clear to me that there's enough confusion around the implications of
the SQL identifiers in the spec that it's worth calling it out.

Thanks,

Amogh Jahagirdar

On Thu, Oct 10, 2024 at 5:08 PM Daniel Weeks  wrote:

> Russell,
>
> I think there are a few existing ways to support that.  For example, if
> you exclude the default catalog and fully reference the table with
> .. most sql engines will interpret that correctly (for
> cross or known catalogs).  Also, if you omit the catalog and use a just
> ., it must use the catalog in which the view is defined (per the
> spec), which I think addresses your case.
>
> Server-side rewrite is possible, but I think we'd need to explore the
> specific cases, which we'll probably need to do as we consider secure views
> more closely.
>
> -Dan
>
> On Thu, Oct 10, 2024 at 3:59 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Hi Russel,
>>
>> Would this be a good candidate for a future version of the spec?
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Thu, Oct 10, 2024 at 3:50 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I still have an issue with representations not having explicit ways of
>>> incorporating the catalog name, I'm thinking about our potential future
>>> situation where we want to return a view for Fine Grained Access policies.
>>> In that case won't the Catalog need to craft a representation that matches
>>> the configuration of the engine? Doesn't this mean the client will have to
>>> tell the Catalog what its local name is?
>>>
>>> On Thu, Oct 10, 2024 at 5:34 PM Daniel Weeks  wrote:
>>>
 Hey Walaa,

 I recognize the issue you're calling out but disagree there is an
 implicit assumption in the spec.  The spec clearly says how identifiers
 including catalogs and namespaces are represented/stored and how references
 need to be resolved.  The idea that a catalog may not match is an
 environmental/infrastructure/configuration issue related to where they are
 being referenced from.

 If we think this is sufficiently confusing to people, I would be open
 to discussing an "unsupported configurations" callout, but I don't think
 this blocks work and am somewhat skeptical that it's necessary.

 -Dan



 On Thu, Oct 10, 2024 at 2:47 PM Walaa Eldin Moustafa <
 wa.moust...@gmail.com> wrote:

> Hi Dan,
>
> I think there are a few questions that we should solve to decide the
> path forward:
>
> ** Does the current spec contain implicit assumptions?*
> I think the answer is yes. I think this is also what Ryan indicated
> here [1].
>
> ** Do these implicit assumptions make it difficult to adopt the spec
> or evolve it in the correct way?*
> I think the answer is yes as well. MV design discussions became quite
> complicated because most contributors had a different understanding of the
> spec compared to what it encodes as implicit assumptions (see this thread
> for an example [2] -- there are a few more). This unaligned understanding
> could possibly lead to inaccurate designs and potentially result in
> unneeded further constraints or unneeded engineering complexity.
>
> ** What are the implicit assumptions (in an ambiguous way)?*
> I do not think the answer is clear to everyone, even at this point.
> There have been a few variations of those assumptions in this thread 
> alone.
> I think we should converge on a clear set of assumptions for everyone's
> consumption.
>
> ** Should we add the assumptions explicitly to the spec?*
> I think we definitely should. Adoption or extension of the spec will
> be quite difficult if the assumptions are not clearly stated and are
> interpreted differently by different contributors.
>
> Would be great to hear the community's feedback on whether they agree
> with the answers to the above questions.
>
> [1] https://lists.apache.org/thread/s1hjnc163ny76smv2l0t2sxxn93s4595
> [2] ht

Re: Iceberg View Spec Improvements

2024-10-10 Thread Daniel Weeks
Russell,

I think there are a few existing ways to support that.  For example, if you
exclude the default catalog and fully reference the table with
.. most sql engines will interpret that correctly (for
cross or known catalogs).  Also, if you omit the catalog and use a just
., it must use the catalog in which the view is defined (per the
spec), which I think addresses your case.

Server-side rewrite is possible, but I think we'd need to explore the
specific cases, which we'll probably need to do as we consider secure views
more closely.

-Dan

On Thu, Oct 10, 2024 at 3:59 PM Walaa Eldin Moustafa 
wrote:

> Hi Russel,
>
> Would this be a good candidate for a future version of the spec?
>
> Thanks,
> Walaa.
>
>
> On Thu, Oct 10, 2024 at 3:50 PM Russell Spitzer 
> wrote:
>
>> I still have an issue with representations not having explicit ways of
>> incorporating the catalog name, I'm thinking about our potential future
>> situation where we want to return a view for Fine Grained Access policies.
>> In that case won't the Catalog need to craft a representation that matches
>> the configuration of the engine? Doesn't this mean the client will have to
>> tell the Catalog what its local name is?
>>
>> On Thu, Oct 10, 2024 at 5:34 PM Daniel Weeks  wrote:
>>
>>> Hey Walaa,
>>>
>>> I recognize the issue you're calling out but disagree there is an
>>> implicit assumption in the spec.  The spec clearly says how identifiers
>>> including catalogs and namespaces are represented/stored and how references
>>> need to be resolved.  The idea that a catalog may not match is an
>>> environmental/infrastructure/configuration issue related to where they are
>>> being referenced from.
>>>
>>> If we think this is sufficiently confusing to people, I would be open to
>>> discussing an "unsupported configurations" callout, but I don't think this
>>> blocks work and am somewhat skeptical that it's necessary.
>>>
>>> -Dan
>>>
>>>
>>>
>>> On Thu, Oct 10, 2024 at 2:47 PM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
 Hi Dan,

 I think there are a few questions that we should solve to decide the
 path forward:

 ** Does the current spec contain implicit assumptions?*
 I think the answer is yes. I think this is also what Ryan indicated
 here [1].

 ** Do these implicit assumptions make it difficult to adopt the spec or
 evolve it in the correct way?*
 I think the answer is yes as well. MV design discussions became quite
 complicated because most contributors had a different understanding of the
 spec compared to what it encodes as implicit assumptions (see this thread
 for an example [2] -- there are a few more). This unaligned understanding
 could possibly lead to inaccurate designs and potentially result in
 unneeded further constraints or unneeded engineering complexity.

 ** What are the implicit assumptions (in an ambiguous way)?*
 I do not think the answer is clear to everyone, even at this point.
 There have been a few variations of those assumptions in this thread alone.
 I think we should converge on a clear set of assumptions for everyone's
 consumption.

 ** Should we add the assumptions explicitly to the spec?*
 I think we definitely should. Adoption or extension of the spec will be
 quite difficult if the assumptions are not clearly stated and are
 interpreted differently by different contributors.

 Would be great to hear the community's feedback on whether they agree
 with the answers to the above questions.

 [1] https://lists.apache.org/thread/s1hjnc163ny76smv2l0t2sxxn93s4595
 [2] https://lists.apache.org/thread/0wzowd15328rnwvotzcoo4jrdyrzlx91

 Thanks,
 Walaa.

>>>


Re: [PROPOSAL] Partially Loading Metadata - LoadTable V2

2024-10-10 Thread Eduard Tudenhöfner
Hey Haizhou,

thanks for working on that proposal. I think my main concern with the
current proposal is that it adds quite a lot of complexity at a bunch of
places, since you'd need to partially update *TableMetadata*. Additionally,
it requires a new endpoint.

An alternative to that would be to do something similar to what we already
have in *TableMetadata*, where we lazily load *snapshots* when needed. We
could expand that approach to lazily load the full *TableMetadata* from the
server when necessary and always only show a slim version of *TableMetadata*.
I did such a POC a while ago, which can be seen in
https://github.com/nastra/iceberg/commit/ae2c7768c6f37be2f86b575bfc4fe84429b22a0e.
That POC would need to be expanded so that it doesn't only do this for
snapshots, but also for other fields.
I believe the main fields that can get quite large over time are *snapshots
/ metadata-log / snapshot-log / schemas*.

Might be worth checking how much we could gain by using a lazy table
metadata supplier in this scenario, as that would reduce the required
complexity.

Thanks,
Eduard



On Thu, Oct 10, 2024 at 2:05 AM Haizhou Zhao 
wrote:

> Hello Dev List,
>
>
> I want to bring this proposal to discussion:
>
>
>
> https://docs.google.com/document/d/1eXnT0ZiFvdm_Zvk6fLGT_UxVWO-HsiqVywqu1Uk8s7E/edit#heading=h.uad1lm906wz4
>
>
>
> It proposes a new LoadTable API (branded LoadTableV2 at the moment) on
> REST spec that allows partially loading table metadata. The motivation is
> to stabilize and optimize Spark write workloads, especially on Iceberg
> tables with big metadata (e.g. due to huge list of snapshot/metadata log,
> complicated schema, etc.). We want to leverage this proposal to reduce
> operational and monetary cost of Iceberg & REST catalog usages, and achieve
> higher commit frequencies (DDL & DML included) on top of Iceberg tables
> through REST catalog.
>
>
>
> Looking forward to hearing feedback and discussions.
>
>
> Thank you,
>
> Haizhou
>


Re: [Discuss] Apache Iceberg 1.6.2 release because of Avro CVE ?

2024-10-10 Thread Ajantha Bhat
If it is already analyzed and not really applicable for Iceberg,
we can wait for 1.7.0.

Thanks.
- Ajantha

On Thu, Oct 10, 2024 at 3:41 PM Jean-Baptiste Onofré 
wrote:

> Hi
>
> I did the security fix in Avro and I can say that Iceberg is not
> really impacted and vulnerable.
> I'm not against a 1.6.2 release, but as we discussed about Iceberg
> 1.7.0 by the end of October (see Russell's message a few days ago),
> maybe we can wait 1.7.0 ?
>
> Regards
> JB
>
> On Wed, Oct 9, 2024 at 8:46 PM Ajantha Bhat  wrote:
> >
> > Hi everyone,
> > Since 1.7.0 is still a few weeks away,
> > how about releasing version 1.6.2 with just the Avro version update?
> > The current Avro version in 1.6.1 (1.11.3) has a recently reported CVE:
> CVE-2024-47561. [2]
> >
> > I'm happy to coordinate and be the release manager for this.
> >
> > [1]
> https://github.com/apache/iceberg/blob/8e9d59d299be42b0bca9461457cd1e95dbaad086/gradle/libs.versions.toml#L28
> > [2] https://lists.apache.org/thread/c2v7mhqnmq0jmbwxqq3r5jbj1xg43h5x
> >
> > - Ajantha
>


Re: [Discuss] Apache Iceberg 1.6.2 release because of Avro CVE ?

2024-10-10 Thread Jean-Baptiste Onofré
Hi

I did the security fix in Avro and I can say that Iceberg is not
really impacted and vulnerable.
I'm not against a 1.6.2 release, but as we discussed about Iceberg
1.7.0 by the end of October (see Russell's message a few days ago),
maybe we can wait 1.7.0 ?

Regards
JB

On Wed, Oct 9, 2024 at 8:46 PM Ajantha Bhat  wrote:
>
> Hi everyone,
> Since 1.7.0 is still a few weeks away,
> how about releasing version 1.6.2 with just the Avro version update?
> The current Avro version in 1.6.1 (1.11.3) has a recently reported CVE: 
> CVE-2024-47561. [2]
>
> I'm happy to coordinate and be the release manager for this.
>
> [1] 
> https://github.com/apache/iceberg/blob/8e9d59d299be42b0bca9461457cd1e95dbaad086/gradle/libs.versions.toml#L28
> [2] https://lists.apache.org/thread/c2v7mhqnmq0jmbwxqq3r5jbj1xg43h5x
>
> - Ajantha


Re: [VOTE] Table V3 Spec: Row Lineage

2024-10-10 Thread Eduard Tudenhöfner
I left a few comments on the proposal but I'm overall +1 on the proposal

On Thu, Oct 10, 2024 at 12:08 PM Jean-Baptiste Onofré 
wrote:

> +1
>
> I did a review on the proposal and it looks good to me.
>
> Regards
> JB
>
> On Tue, Oct 8, 2024 at 3:55 PM Russell Spitzer
>  wrote:
> >
> > Hi Y'all!
> >
> > I think we are more or less in agreement on adding Row Lineage to the
> spec apart from a few details which may change a bit during implementation.
> Because of this, I'd like to call for an overall vote on whether or not
> Row-Lineage as described in  PR 11130 can be added to the spec.
> >
> > I'll note this is basically giving a thumbs up for reviewers and
> implementers to go ahead with the pull-request and acknowledging that you
> support the direction this proposal is going. I do think we'll probably dig
> a few things out when we write the reference implementation, but I think in
> general we have defined the required behaviors we want to see.
> >
> > Please vote in the next 72 hours
> >
> > [ ] +1, commit the proposed spec changes
> > [ ] -0
> > [ ] -1, do not make these changes because . . .
> >
> >
> > Thanks everyone,
> >
> > Russ
>


Re: Iceberg View Spec Improvements

2024-10-10 Thread Daniel Weeks
Hey Walaa,

I recognize the issue you're calling out but disagree there is an implicit
assumption in the spec.  The spec clearly says how identifiers including
catalogs and namespaces are represented/stored and how references need to
be resolved.  The idea that a catalog may not match is an
environmental/infrastructure/configuration issue related to where they are
being referenced from.

If we think this is sufficiently confusing to people, I would be open to
discussing an "unsupported configurations" callout, but I don't think this
blocks work and am somewhat skeptical that it's necessary.

-Dan



On Thu, Oct 10, 2024 at 2:47 PM Walaa Eldin Moustafa 
wrote:

> Hi Dan,
>
> I think there are a few questions that we should solve to decide the path
> forward:
>
> ** Does the current spec contain implicit assumptions?*
> I think the answer is yes. I think this is also what Ryan indicated here
> [1].
>
> ** Do these implicit assumptions make it difficult to adopt the spec or
> evolve it in the correct way?*
> I think the answer is yes as well. MV design discussions became quite
> complicated because most contributors had a different understanding of the
> spec compared to what it encodes as implicit assumptions (see this thread
> for an example [2] -- there are a few more). This unaligned understanding
> could possibly lead to inaccurate designs and potentially result in
> unneeded further constraints or unneeded engineering complexity.
>
> ** What are the implicit assumptions (in an ambiguous way)?*
> I do not think the answer is clear to everyone, even at this point. There
> have been a few variations of those assumptions in this thread alone. I
> think we should converge on a clear set of assumptions for everyone's
> consumption.
>
> ** Should we add the assumptions explicitly to the spec?*
> I think we definitely should. Adoption or extension of the spec will be
> quite difficult if the assumptions are not clearly stated and are
> interpreted differently by different contributors.
>
> Would be great to hear the community's feedback on whether they agree with
> the answers to the above questions.
>
> [1] https://lists.apache.org/thread/s1hjnc163ny76smv2l0t2sxxn93s4595
> [2] https://lists.apache.org/thread/0wzowd15328rnwvotzcoo4jrdyrzlx91
>
> Thanks,
> Walaa.
>


Re: Iceberg View Spec Improvements

2024-10-10 Thread Russell Spitzer
I still have an issue with representations not having explicit ways of
incorporating the catalog name, I'm thinking about our potential future
situation where we want to return a view for Fine Grained Access policies.
In that case won't the Catalog need to craft a representation that matches
the configuration of the engine? Doesn't this mean the client will have to
tell the Catalog what its local name is?

On Thu, Oct 10, 2024 at 5:34 PM Daniel Weeks  wrote:

> Hey Walaa,
>
> I recognize the issue you're calling out but disagree there is an implicit
> assumption in the spec.  The spec clearly says how identifiers including
> catalogs and namespaces are represented/stored and how references need to
> be resolved.  The idea that a catalog may not match is an
> environmental/infrastructure/configuration issue related to where they are
> being referenced from.
>
> If we think this is sufficiently confusing to people, I would be open to
> discussing an "unsupported configurations" callout, but I don't think this
> blocks work and am somewhat skeptical that it's necessary.
>
> -Dan
>
>
>
> On Thu, Oct 10, 2024 at 2:47 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Hi Dan,
>>
>> I think there are a few questions that we should solve to decide the path
>> forward:
>>
>> ** Does the current spec contain implicit assumptions?*
>> I think the answer is yes. I think this is also what Ryan indicated here
>> [1].
>>
>> ** Do these implicit assumptions make it difficult to adopt the spec or
>> evolve it in the correct way?*
>> I think the answer is yes as well. MV design discussions became quite
>> complicated because most contributors had a different understanding of the
>> spec compared to what it encodes as implicit assumptions (see this thread
>> for an example [2] -- there are a few more). This unaligned understanding
>> could possibly lead to inaccurate designs and potentially result in
>> unneeded further constraints or unneeded engineering complexity.
>>
>> ** What are the implicit assumptions (in an ambiguous way)?*
>> I do not think the answer is clear to everyone, even at this point. There
>> have been a few variations of those assumptions in this thread alone. I
>> think we should converge on a clear set of assumptions for everyone's
>> consumption.
>>
>> ** Should we add the assumptions explicitly to the spec?*
>> I think we definitely should. Adoption or extension of the spec will be
>> quite difficult if the assumptions are not clearly stated and are
>> interpreted differently by different contributors.
>>
>> Would be great to hear the community's feedback on whether they agree
>> with the answers to the above questions.
>>
>> [1] https://lists.apache.org/thread/s1hjnc163ny76smv2l0t2sxxn93s4595
>> [2] https://lists.apache.org/thread/0wzowd15328rnwvotzcoo4jrdyrzlx91
>>
>> Thanks,
>> Walaa.
>>
>


Re: Iceberg View Spec Improvements

2024-10-10 Thread Walaa Eldin Moustafa
Thanks Dan. I am +1 for documenting unsupported configurations.

On Thu, Oct 10, 2024 at 3:34 PM Daniel Weeks  wrote:

> Hey Walaa,
>
> I recognize the issue you're calling out but disagree there is an implicit
> assumption in the spec.  The spec clearly says how identifiers including
> catalogs and namespaces are represented/stored and how references need to
> be resolved.  The idea that a catalog may not match is an
> environmental/infrastructure/configuration issue related to where they are
> being referenced from.
>
> If we think this is sufficiently confusing to people, I would be open to
> discussing an "unsupported configurations" callout, but I don't think this
> blocks work and am somewhat skeptical that it's necessary.
>
> -Dan
>
>
>
> On Thu, Oct 10, 2024 at 2:47 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Hi Dan,
>>
>> I think there are a few questions that we should solve to decide the path
>> forward:
>>
>> ** Does the current spec contain implicit assumptions?*
>> I think the answer is yes. I think this is also what Ryan indicated here
>> [1].
>>
>> ** Do these implicit assumptions make it difficult to adopt the spec or
>> evolve it in the correct way?*
>> I think the answer is yes as well. MV design discussions became quite
>> complicated because most contributors had a different understanding of the
>> spec compared to what it encodes as implicit assumptions (see this thread
>> for an example [2] -- there are a few more). This unaligned understanding
>> could possibly lead to inaccurate designs and potentially result in
>> unneeded further constraints or unneeded engineering complexity.
>>
>> ** What are the implicit assumptions (in an ambiguous way)?*
>> I do not think the answer is clear to everyone, even at this point. There
>> have been a few variations of those assumptions in this thread alone. I
>> think we should converge on a clear set of assumptions for everyone's
>> consumption.
>>
>> ** Should we add the assumptions explicitly to the spec?*
>> I think we definitely should. Adoption or extension of the spec will be
>> quite difficult if the assumptions are not clearly stated and are
>> interpreted differently by different contributors.
>>
>> Would be great to hear the community's feedback on whether they agree
>> with the answers to the above questions.
>>
>> [1] https://lists.apache.org/thread/s1hjnc163ny76smv2l0t2sxxn93s4595
>> [2] https://lists.apache.org/thread/0wzowd15328rnwvotzcoo4jrdyrzlx91
>>
>> Thanks,
>> Walaa.
>>
>


Re: [VOTE] Table V3 Spec: Row Lineage

2024-10-10 Thread Steven Wu
+1

On Thu, Oct 10, 2024 at 2:52 PM Yufei Gu  wrote:

> +1
> Yufei
>
>
> On Thu, Oct 10, 2024 at 3:47 PM Amogh Jahagirdar <2am...@gmail.com> wrote:
>
>> +1, I've been reviewing this proposal/spec change for a bit and I think
>> it's in a good state for the community to work on an implementation.
>>
>> Thanks Russell for driving this!
>>
>> On Thu, Oct 10, 2024 at 3:31 PM Jack Ye  wrote:
>>
>>> +1, overall agree that we should add this!
>>>
>>> Best,
>>> Jack Ye
>>>
>>> On Thu, Oct 10, 2024 at 1:43 PM Daniel Weeks  wrote:
>>>
 +1

 Thanks Russell!

 On Thu, Oct 10, 2024 at 6:57 AM Eduard Tudenhöfner <
 etudenhoef...@apache.org> wrote:

> I left a few comments on the proposal but I'm overall +1 on the
> proposal
>
> On Thu, Oct 10, 2024 at 12:08 PM Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> I did a review on the proposal and it looks good to me.
>>
>> Regards
>> JB
>>
>> On Tue, Oct 8, 2024 at 3:55 PM Russell Spitzer
>>  wrote:
>> >
>> > Hi Y'all!
>> >
>> > I think we are more or less in agreement on adding Row Lineage to
>> the spec apart from a few details which may change a bit during
>> implementation. Because of this, I'd like to call for an overall vote on
>> whether or not Row-Lineage as described in  PR 11130 can be added to the
>> spec.
>> >
>> > I'll note this is basically giving a thumbs up for reviewers and
>> implementers to go ahead with the pull-request and acknowledging that you
>> support the direction this proposal is going. I do think we'll probably 
>> dig
>> a few things out when we write the reference implementation, but I think 
>> in
>> general we have defined the required behaviors we want to see.
>> >
>> > Please vote in the next 72 hours
>> >
>> > [ ] +1, commit the proposed spec changes
>> > [ ] -0
>> > [ ] -1, do not make these changes because . . .
>> >
>> >
>> > Thanks everyone,
>> >
>> > Russ
>>
>


Re: Iceberg View Spec Improvements

2024-10-10 Thread Walaa Eldin Moustafa
Hi Russel,

Would this be a good candidate for a future version of the spec?

Thanks,
Walaa.


On Thu, Oct 10, 2024 at 3:50 PM Russell Spitzer 
wrote:

> I still have an issue with representations not having explicit ways of
> incorporating the catalog name, I'm thinking about our potential future
> situation where we want to return a view for Fine Grained Access policies.
> In that case won't the Catalog need to craft a representation that matches
> the configuration of the engine? Doesn't this mean the client will have to
> tell the Catalog what its local name is?
>
> On Thu, Oct 10, 2024 at 5:34 PM Daniel Weeks  wrote:
>
>> Hey Walaa,
>>
>> I recognize the issue you're calling out but disagree there is an
>> implicit assumption in the spec.  The spec clearly says how identifiers
>> including catalogs and namespaces are represented/stored and how references
>> need to be resolved.  The idea that a catalog may not match is an
>> environmental/infrastructure/configuration issue related to where they are
>> being referenced from.
>>
>> If we think this is sufficiently confusing to people, I would be open to
>> discussing an "unsupported configurations" callout, but I don't think this
>> blocks work and am somewhat skeptical that it's necessary.
>>
>> -Dan
>>
>>
>>
>> On Thu, Oct 10, 2024 at 2:47 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Hi Dan,
>>>
>>> I think there are a few questions that we should solve to decide the
>>> path forward:
>>>
>>> ** Does the current spec contain implicit assumptions?*
>>> I think the answer is yes. I think this is also what Ryan indicated here
>>> [1].
>>>
>>> ** Do these implicit assumptions make it difficult to adopt the spec or
>>> evolve it in the correct way?*
>>> I think the answer is yes as well. MV design discussions became quite
>>> complicated because most contributors had a different understanding of the
>>> spec compared to what it encodes as implicit assumptions (see this thread
>>> for an example [2] -- there are a few more). This unaligned understanding
>>> could possibly lead to inaccurate designs and potentially result in
>>> unneeded further constraints or unneeded engineering complexity.
>>>
>>> ** What are the implicit assumptions (in an ambiguous way)?*
>>> I do not think the answer is clear to everyone, even at this point.
>>> There have been a few variations of those assumptions in this thread alone.
>>> I think we should converge on a clear set of assumptions for everyone's
>>> consumption.
>>>
>>> ** Should we add the assumptions explicitly to the spec?*
>>> I think we definitely should. Adoption or extension of the spec will be
>>> quite difficult if the assumptions are not clearly stated and are
>>> interpreted differently by different contributors.
>>>
>>> Would be great to hear the community's feedback on whether they agree
>>> with the answers to the above questions.
>>>
>>> [1] https://lists.apache.org/thread/s1hjnc163ny76smv2l0t2sxxn93s4595
>>> [2] https://lists.apache.org/thread/0wzowd15328rnwvotzcoo4jrdyrzlx91
>>>
>>> Thanks,
>>> Walaa.
>>>
>>


Re: [Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

2024-10-10 Thread Jean-Baptiste Onofré
Hi

As we are talking about "documentation" (quick start/readme), I would
rather propose to use the REST catalog here instead of JDBC.

As it's the catalog we "promote", I think it would be valuable for
users to start with the "right thing".

JDBC Catalog is interesting for quick test/started guide, but we know
how it goes: it will be heavily use (see what happened with the
HadoopCatalog used in production whereas it should not :) ).

Regards
JB

On Tue, Oct 8, 2024 at 12:18 PM Kevin Liu  wrote:
>
> Hi all,
>
> I wanted to bring up a suggestion regarding our current documentation. The 
> existing examples for Iceberg often use the Hadoop catalog, as seen in:
>
> Adding a Catalog - Spark Quickstart [1]
> Adding Catalogs - Spark Getting Started [2]
>
> Since we generally advise against using Hadoop catalogs in production 
> environments, I believe it would be beneficial to replace these examples with 
> ones that use the JDBC catalog. The JDBC catalog, configured with a local 
> SQLite database file, offers similar convenience but aligns better with 
> production best practices.
>
> I've created an issue [3] and a PR [4] to address this. Please take a look, 
> and I'd love to hear your thoughts on whether this is a direction we want to 
> pursue.
>
> Best,
> Kevin Liu
>
> [1] https://iceberg.apache.org/spark-quickstart/#adding-a-catalog
> [2] 
> https://iceberg.apache.org/docs/nightly/spark-getting-started/#adding-catalogs
> [3] https://github.com/apache/iceberg/issues/11284
> [4] https://github.com/apache/iceberg/pull/11285
>


Re: [Discuss] Iceberg community maintaining the docker images

2024-10-10 Thread Jean-Baptiste Onofré
Hi

I think there's context missing here.

I agree with Ryan that Iceberg should not provide any docker image or
runtime things (we had the same discussion about REST server).

However, my understanding is that this discussion is also related to the
REST TCK. The TCK validation run needs a runtime, and I remember a
discussion we had with Daniel (running TCK as a docker container).

Regards
JB

On Wed, Oct 9, 2024 at 2:20 PM rdb...@gmail.com  wrote:

> I think it's important for a project to remain focused on its core
> purpose, and I've always advocated for Iceberg to remain a library that is
> easy to plug into other projects. I think that should be the guide here as
> well. Aren't projects like Spark and Trino responsible for producing easy
> to use Docker images of those environments? Why would the Iceberg project
> build and maintain them?
>
> I would prefer not to be distracted by these things, unless we need them
> for cases like supporting testing and validation of things that are part of
> the core purpose of the project.
>
> On Tue, Oct 8, 2024 at 6:08 AM Ajantha Bhat  wrote:
>
>> Hello everyone,
>>
>> Now that the test fixtures are in [1],we can create a runtime JAR for the
>> REST catalog adapter [2] from the TCK.
>> Following that, we can build and maintain the Docker image based on it
>> [3].
>>
>> I also envision the Iceberg community maintaining some quick-start Docker
>> images, such as spark-iceberg-rest, Trino-iceberg-rest, among others.
>>
>> I've looked into other Apache projects, and it seems that Apache Infra
>> can assist us with this process.
>> As we have the option to publish Iceberg docker images under the Apache
>> Docker Hub account.
>>
>> [image: image.png]
>>
>> I am more than willing to maintain this code, please find the PRs related
>> to the same [2] & [3].
>>
>> Any suggestions on the same? contributions are welcome if we agree to
>> maintain it.
>>
>> [1] https://github.com/apache/iceberg/pull/10908
>> [2] https://github.com/apache/iceberg/pull/11279
>> [3] https://github.com/apache/iceberg/pull/11283
>>
>> - Ajantha
>>
>


Re: [VOTE] Table V3 Spec: Row Lineage

2024-10-10 Thread Jean-Baptiste Onofré
+1

I did a review on the proposal and it looks good to me.

Regards
JB

On Tue, Oct 8, 2024 at 3:55 PM Russell Spitzer
 wrote:
>
> Hi Y'all!
>
> I think we are more or less in agreement on adding Row Lineage to the spec 
> apart from a few details which may change a bit during implementation. 
> Because of this, I'd like to call for an overall vote on whether or not 
> Row-Lineage as described in  PR 11130 can be added to the spec.
>
> I'll note this is basically giving a thumbs up for reviewers and implementers 
> to go ahead with the pull-request and acknowledging that you support the 
> direction this proposal is going. I do think we'll probably dig a few things 
> out when we write the reference implementation, but I think in general we 
> have defined the required behaviors we want to see.
>
> Please vote in the next 72 hours
>
> [ ] +1, commit the proposed spec changes
> [ ] -0
> [ ] -1, do not make these changes because . . .
>
>
> Thanks everyone,
>
> Russ


Re: [DISCUSS] REST: Standardize vended credentials in Spec

2024-10-10 Thread Eduard Tudenhöfner
Based on recent discussions the feedback was that we don't want to have
anything storage-specific in the OpenAPI spec (other than documenting the
different storage configurations, which is handled by #10576
).
Therefore I've updated the PR and made it flexible enough so that we could
still pass different credentials based on a given *prefix* but not need a
new Spec change whenever new credentials are added/changed for a
storage provider.
That should hopefully work for everyone, so please take a look so that we
can do a formal VOTE on these changes.

It was also brought up that it would be helpful to see how this looks in
the context of refreshing vended credentials. I'll share the proposal for
refreshing vended credentials a bit later today.

Thanks
Eduard

On Tue, Sep 24, 2024 at 6:34 PM Dmitri Bourlatchkov
 wrote:

> > wrt ISO 8601 timestamps: I'd like to keep things consistent with other
> places in the spec, which are typically defined as millisecond values.
>
> Fair enough. Now that the spec states the reference point in time, using
> millisecond offsets is fine.
>
> Cheers,
> Dmitri.
>
> On Tue, Sep 24, 2024 at 10:41 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> Thanks Dmitri for reviewing the PR and the doc.
>>
>> wrt ISO 8601 timestamps: I'd like to keep things consistent with other
>> places in the spec, which are typically defined as millisecond values.
>>
>> Thanks
>> Eduard
>>
>> On Fri, Sep 20, 2024 at 4:46 PM Dmitri Bourlatchkov
>>  wrote:
>>
>>> Thanks for proposing this improvement, Eduard!
>>>
>>> Overall it seems pretty reasonable to me. I added a few comments in GH
>>> and in the doc.
>>>
>>> One higher level point I'd like to discuss is using ISO 8601 to format
>>> expiry timestamps (as opposed to numeric milliseconds values). This should
>>> hopefully make the config more human-readable without adding too much
>>> processing burden. I hope the standard is well supported by most language
>>> libraries now. It is certainly supported by java. WDYT?
>>>
>>> Thanks,
>>> Dmitri.
>>>
>>> On Fri, Sep 13, 2024 at 12:13 PM Eduard Tudenhöfner <
>>> etudenhoef...@apache.org> wrote:
>>>
 Hey everyone,

 I'd like to propose standardizing the vended credentials used in
 loadTable / loadView responses.

 I opened #8  to
 track the proposal in GH.
 Please find the proposal doc here
 
  (estimated
 read time: 5 minutes).
 The proposal requires a spec change, which can be seen in #10722
 .

 Thanks,
 Eduard

>>>


Re: [Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

2024-10-10 Thread Eduard Tudenhöfner
I would prefer to advocate for the REST catalog in those examples/docs
(similar to how the Spark quickstart example
 uses the REST catalog). The
docs could then refer to the quickstart example to indicate what's required
in terms of services to be started before a user can spawn a spark shell.

On Thu, Oct 10, 2024 at 12:15 PM Jean-Baptiste Onofré 
wrote:

> Hi
>
> As we are talking about "documentation" (quick start/readme), I would
> rather propose to use the REST catalog here instead of JDBC.
>
> As it's the catalog we "promote", I think it would be valuable for
> users to start with the "right thing".
>
> JDBC Catalog is interesting for quick test/started guide, but we know
> how it goes: it will be heavily use (see what happened with the
> HadoopCatalog used in production whereas it should not :) ).
>
> Regards
> JB
>
> On Tue, Oct 8, 2024 at 12:18 PM Kevin Liu  wrote:
> >
> > Hi all,
> >
> > I wanted to bring up a suggestion regarding our current documentation.
> The existing examples for Iceberg often use the Hadoop catalog, as seen in:
> >
> > Adding a Catalog - Spark Quickstart [1]
> > Adding Catalogs - Spark Getting Started [2]
> >
> > Since we generally advise against using Hadoop catalogs in production
> environments, I believe it would be beneficial to replace these examples
> with ones that use the JDBC catalog. The JDBC catalog, configured with a
> local SQLite database file, offers similar convenience but aligns better
> with production best practices.
> >
> > I've created an issue [3] and a PR [4] to address this. Please take a
> look, and I'd love to hear your thoughts on whether this is a direction we
> want to pursue.
> >
> > Best,
> > Kevin Liu
> >
> > [1] https://iceberg.apache.org/spark-quickstart/#adding-a-catalog
> > [2]
> https://iceberg.apache.org/docs/nightly/spark-getting-started/#adding-catalogs
> > [3] https://github.com/apache/iceberg/issues/11284
> > [4] https://github.com/apache/iceberg/pull/11285
> >
>


[DISCUSS] REST: Refreshing vended credentials

2024-10-10 Thread Eduard Tudenhöfner
Hey everyone,

I'd like to propose a mechanism and changes in order to be able to refresh
vended credentials for tables.

Please find the proposal doc here

.
The proposal requires a spec change, which can be seen in #11281
.

As discussed in the last sync, this should hopefully help in better
understanding the proposal around standardizing credentials in the OpenAPI
spec, which is being discussed in
https://lists.apache.org/thread/jmklpnywnghg7qwmwr14zj2k6tnxmdo4.

Thanks,
Eduard


Re: [Discuss] Iceberg community maintaining the docker images

2024-10-10 Thread Ajantha Bhat
Yes, the PRs I mentioned are about running TCK as a docker container and
keeping/maintaining that docker file in the Iceberg repo.

I envisioned maintaining other docker images also because I am not sure
about the roadmap of the ones in our quickstart
 (example:
tabulario/spark-iceberg).

Thanks,
Ajantha

On Thu, Oct 10, 2024 at 3:50 PM Jean-Baptiste Onofré 
wrote:

> Hi
>
> I think there's context missing here.
>
> I agree with Ryan that Iceberg should not provide any docker image or
> runtime things (we had the same discussion about REST server).
>
> However, my understanding is that this discussion is also related to the
> REST TCK. The TCK validation run needs a runtime, and I remember a
> discussion we had with Daniel (running TCK as a docker container).
>
> Regards
> JB
>
> On Wed, Oct 9, 2024 at 2:20 PM rdb...@gmail.com  wrote:
>
>> I think it's important for a project to remain focused on its core
>> purpose, and I've always advocated for Iceberg to remain a library that is
>> easy to plug into other projects. I think that should be the guide here as
>> well. Aren't projects like Spark and Trino responsible for producing easy
>> to use Docker images of those environments? Why would the Iceberg project
>> build and maintain them?
>>
>> I would prefer not to be distracted by these things, unless we need them
>> for cases like supporting testing and validation of things that are part of
>> the core purpose of the project.
>>
>> On Tue, Oct 8, 2024 at 6:08 AM Ajantha Bhat 
>> wrote:
>>
>>> Hello everyone,
>>>
>>> Now that the test fixtures are in [1],we can create a runtime JAR for
>>> the REST catalog adapter [2] from the TCK.
>>> Following that, we can build and maintain the Docker image based on it
>>> [3].
>>>
>>> I also envision the Iceberg community maintaining some quick-start
>>> Docker images, such as spark-iceberg-rest, Trino-iceberg-rest, among others.
>>>
>>> I've looked into other Apache projects, and it seems that Apache Infra
>>> can assist us with this process.
>>> As we have the option to publish Iceberg docker images under the Apache
>>> Docker Hub account.
>>>
>>> [image: image.png]
>>>
>>> I am more than willing to maintain this code, please find the PRs
>>> related to the same [2] & [3].
>>>
>>> Any suggestions on the same? contributions are welcome if we agree to
>>> maintain it.
>>>
>>> [1] https://github.com/apache/iceberg/pull/10908
>>> [2] https://github.com/apache/iceberg/pull/11279
>>> [3] https://github.com/apache/iceberg/pull/11283
>>>
>>> - Ajantha
>>>
>>


Re: [Discuss] Iceberg community maintaining the docker images

2024-10-10 Thread Jean-Baptiste Onofré
It's actually what I meant by REST Catalog docker image for test.

Personally, I would not include any docker images in the Iceberg project
(but more in the "iceberg" ecosystem, which is different from the project
:)).

However, if the community has a different view on that, no problem.

Regards
JB

On Thu, Oct 10, 2024 at 9:50 AM Daniel Weeks  wrote:

> I think we should focus on the docker image for the test REST Catalog
> implementation.  This is somewhat different from the TCK since it's used by
> the python/rust/go projects for testing the client side of the REST
> specification.
>
> As for the quickstart/example type images, I'm open to discussing what
> makes sense here, but we should decouple that and other docker images from
> getting a test REST catalog image out.  (Seems like there's general
> consensus around that).
>
> -Dan
>
> On Thu, Oct 10, 2024 at 4:29 AM Ajantha Bhat 
> wrote:
>
>> Yes, the PRs I mentioned are about running TCK as a docker container and
>> keeping/maintaining that docker file in the Iceberg repo.
>>
>> I envisioned maintaining other docker images also because I am not sure
>> about the roadmap of the ones in our quickstart
>>  (example:
>> tabulario/spark-iceberg).
>>
>> Thanks,
>> Ajantha
>>
>> On Thu, Oct 10, 2024 at 3:50 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi
>>>
>>> I think there's context missing here.
>>>
>>> I agree with Ryan that Iceberg should not provide any docker image or
>>> runtime things (we had the same discussion about REST server).
>>>
>>> However, my understanding is that this discussion is also related to the
>>> REST TCK. The TCK validation run needs a runtime, and I remember a
>>> discussion we had with Daniel (running TCK as a docker container).
>>>
>>> Regards
>>> JB
>>>
>>> On Wed, Oct 9, 2024 at 2:20 PM rdb...@gmail.com 
>>> wrote:
>>>
 I think it's important for a project to remain focused on its core
 purpose, and I've always advocated for Iceberg to remain a library that is
 easy to plug into other projects. I think that should be the guide here as
 well. Aren't projects like Spark and Trino responsible for producing easy
 to use Docker images of those environments? Why would the Iceberg project
 build and maintain them?

 I would prefer not to be distracted by these things, unless we need
 them for cases like supporting testing and validation of things that are
 part of the core purpose of the project.

 On Tue, Oct 8, 2024 at 6:08 AM Ajantha Bhat 
 wrote:

> Hello everyone,
>
> Now that the test fixtures are in [1],we can create a runtime JAR for
> the REST catalog adapter [2] from the TCK.
> Following that, we can build and maintain the Docker image based on it
> [3].
>
> I also envision the Iceberg community maintaining some quick-start
> Docker images, such as spark-iceberg-rest, Trino-iceberg-rest, among 
> others.
>
> I've looked into other Apache projects, and it seems that Apache Infra
> can assist us with this process.
> As we have the option to publish Iceberg docker images under the
> Apache Docker Hub account.
>
> [image: image.png]
>
> I am more than willing to maintain this code, please find the PRs
> related to the same [2] & [3].
>
> Any suggestions on the same? contributions are welcome if we agree to
> maintain it.
>
> [1] https://github.com/apache/iceberg/pull/10908
> [2] https://github.com/apache/iceberg/pull/11279
> [3] https://github.com/apache/iceberg/pull/11283
>
> - Ajantha
>



Re: [Discuss] Apache Iceberg 1.6.2 release because of Avro CVE ?

2024-10-10 Thread Manu Zhang
Hi Ajantha,

There is a bug[1] in migration procedures (e.g. add_files) when the option
`parallelism` is larger than 1.
I've submitted a fix[2] against the main branch and would like to back-port
to 1.6.x.

[1] https://github.com/apache/iceberg/issues/11147
[2] https://github.com/apache/iceberg/pull/11157

Thanks,
Manu


On Thu, Oct 10, 2024 at 10:48 AM Ajantha Bhat  wrote:

> Hi everyone,
> Since 1.7.0 is still a few weeks away,
> how about releasing version 1.6.2 with just the Avro version update?
> The current Avro version in 1.6.1 (1.11.3) has a recently reported CVE:
> CVE-2024-47561 . [2]
>
> I'm happy to coordinate and be the release manager for this.
> [1]
> https://github.com/apache/iceberg/blob/8e9d59d299be42b0bca9461457cd1e95dbaad086/gradle/libs.versions.toml#L28
> [2] https://lists.apache.org/thread/c2v7mhqnmq0jmbwxqq3r5jbj1xg43h5x
>
> - Ajantha
>


Re: Iceberg View Spec Improvements

2024-10-10 Thread Daniel Weeks
Walaa,

I just want to expand upon what Ryan said a little.  The catalog naming
issue was identified when we designed the view spec and we opted for
simplicity as opposed to trying to solve for catalog name mapping as it
really complicates the spec/implementation.  There may be ways for
implementations to address this where there are inconsistencies (like
Russel's federation suggestion or even some level of catalog or engine
redirection like Trino supports), however I don't think this needs to be
called out explicitly in the view spec.  The spec is clear about catalog
names and default namespaces, so while inconsistent names are important in
practice, the spec is clear.

I would also point out that there are other portability problems like
multi-level namespaces, which are supported both in the default namespace
and via sql syntax, but many engines cannot interpret.  There's a lot that
is not (or cannot) be fully addressed in the view spec, but for practical
purposes is still very capable and portable.

-Dan



On Wed, Oct 9, 2024 at 6:28 PM Walaa Eldin Moustafa 
wrote:

> Thanks Ryan and everyone who left feedback on the doc. Let me clarify a
> few things.
>
> "Improving the spec" also includes making the implicit assumptions
> explicitly stated in the spec.
>
> Explicitly stating the assumptions is discussed under the "Portable table
> identifiers" section in the doc. I am onboard with that direction.
>
> I think this section encodes the suggestions shared by Steven and Russel
> as well as the suggestion shared by you, and a couple more actually to
> ensure it is comprehensive/unambiguous. I will reiterate the assumptions
> below. If folks think we could go with those assumptions, I can create a PR
> to reflect them on the spec.
>
> * Engines must share the same default catalog names, ensuring that
> partially specified SQL identifiers with catalog omitted are resolved to
> the same fully specified SQL identifier across all engines.
> * Engines must share the same default namespaces, ensuring that SQL
> identifiers without catalog and namespace are resolved to the same fully
> specified SQL identifier across all engines.
> * All engines must resolve a fully specified SQL identifier to the same
> storage table in the same storage catalog.
>
> Thanks,
> Walaa.
>
>


Re: [PROPOSAL] Partially Loading Metadata - LoadTable V2

2024-10-10 Thread Daniel Weeks
Hey Haizhou,

I think you've done a great job of capturing some of the metadata size
related issues in the doc, but I would echo Eduard's comments that we
should explore using the existing refs only loading first.  This may
require adding similar functionality for schemas/logs if we think that is a
major issue (we have run into cases where that is an issue, but there's
also maintenance work going on to help address some of these issues).

The current refs only approach does fall back to a full metadata load when
committing, but that was largely due to the complexity of changing the
TableMetadata implementation, not necessarily a limitation of the REST spec.

Definitely something we should be exploring, but we might already have some
approaches that we can build upon.

-Dan

On Thu, Oct 10, 2024 at 6:37 AM Eduard Tudenhöfner 
wrote:

> Hey Haizhou,
>
> thanks for working on that proposal. I think my main concern with the
> current proposal is that it adds quite a lot of complexity at a bunch of
> places, since you'd need to partially update *TableMetadata*.
> Additionally, it requires a new endpoint.
>
> An alternative to that would be to do something similar to what we already
> have in *TableMetadata*, where we lazily load *snapshots* when needed. We
> could expand that approach to lazily load the full *TableMetadata* from
> the server when necessary and always only show a slim version of
> *TableMetadata*. I did such a POC a while ago, which can be seen in
> https://github.com/nastra/iceberg/commit/ae2c7768c6f37be2f86b575bfc4fe84429b22a0e.
> That POC would need to be expanded so that it doesn't only do this for
> snapshots, but also for other fields.
> I believe the main fields that can get quite large over time are *snapshots
> / metadata-log / snapshot-log / schemas*.
>
> Might be worth checking how much we could gain by using a lazy table
> metadata supplier in this scenario, as that would reduce the required
> complexity.
>
> Thanks,
> Eduard
>
>
>
> On Thu, Oct 10, 2024 at 2:05 AM Haizhou Zhao 
> wrote:
>
>> Hello Dev List,
>>
>>
>> I want to bring this proposal to discussion:
>>
>>
>>
>> https://docs.google.com/document/d/1eXnT0ZiFvdm_Zvk6fLGT_UxVWO-HsiqVywqu1Uk8s7E/edit#heading=h.uad1lm906wz4
>>
>>
>>
>> It proposes a new LoadTable API (branded LoadTableV2 at the moment) on
>> REST spec that allows partially loading table metadata. The motivation is
>> to stabilize and optimize Spark write workloads, especially on Iceberg
>> tables with big metadata (e.g. due to huge list of snapshot/metadata log,
>> complicated schema, etc.). We want to leverage this proposal to reduce
>> operational and monetary cost of Iceberg & REST catalog usages, and achieve
>> higher commit frequencies (DDL & DML included) on top of Iceberg tables
>> through REST catalog.
>>
>>
>>
>> Looking forward to hearing feedback and discussions.
>>
>>
>> Thank you,
>>
>> Haizhou
>>
>


Re: [Discuss] Iceberg community maintaining the docker images

2024-10-10 Thread Daniel Weeks
I think we should focus on the docker image for the test REST Catalog
implementation.  This is somewhat different from the TCK since it's used by
the python/rust/go projects for testing the client side of the REST
specification.

As for the quickstart/example type images, I'm open to discussing what
makes sense here, but we should decouple that and other docker images from
getting a test REST catalog image out.  (Seems like there's general
consensus around that).

-Dan

On Thu, Oct 10, 2024 at 4:29 AM Ajantha Bhat  wrote:

> Yes, the PRs I mentioned are about running TCK as a docker container and
> keeping/maintaining that docker file in the Iceberg repo.
>
> I envisioned maintaining other docker images also because I am not sure
> about the roadmap of the ones in our quickstart
>  (example:
> tabulario/spark-iceberg).
>
> Thanks,
> Ajantha
>
> On Thu, Oct 10, 2024 at 3:50 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi
>>
>> I think there's context missing here.
>>
>> I agree with Ryan that Iceberg should not provide any docker image or
>> runtime things (we had the same discussion about REST server).
>>
>> However, my understanding is that this discussion is also related to the
>> REST TCK. The TCK validation run needs a runtime, and I remember a
>> discussion we had with Daniel (running TCK as a docker container).
>>
>> Regards
>> JB
>>
>> On Wed, Oct 9, 2024 at 2:20 PM rdb...@gmail.com  wrote:
>>
>>> I think it's important for a project to remain focused on its core
>>> purpose, and I've always advocated for Iceberg to remain a library that is
>>> easy to plug into other projects. I think that should be the guide here as
>>> well. Aren't projects like Spark and Trino responsible for producing easy
>>> to use Docker images of those environments? Why would the Iceberg project
>>> build and maintain them?
>>>
>>> I would prefer not to be distracted by these things, unless we need them
>>> for cases like supporting testing and validation of things that are part of
>>> the core purpose of the project.
>>>
>>> On Tue, Oct 8, 2024 at 6:08 AM Ajantha Bhat 
>>> wrote:
>>>
 Hello everyone,

 Now that the test fixtures are in [1],we can create a runtime JAR for
 the REST catalog adapter [2] from the TCK.
 Following that, we can build and maintain the Docker image based on it
 [3].

 I also envision the Iceberg community maintaining some quick-start
 Docker images, such as spark-iceberg-rest, Trino-iceberg-rest, among 
 others.

 I've looked into other Apache projects, and it seems that Apache Infra
 can assist us with this process.
 As we have the option to publish Iceberg docker images under the Apache
 Docker Hub account.

 [image: image.png]

 I am more than willing to maintain this code, please find the PRs
 related to the same [2] & [3].

 Any suggestions on the same? contributions are welcome if we agree to
 maintain it.

 [1] https://github.com/apache/iceberg/pull/10908
 [2] https://github.com/apache/iceberg/pull/11279
 [3] https://github.com/apache/iceberg/pull/11283

 - Ajantha

>>>


Re: [DISCUSS] REST: Refreshing vended credentials

2024-10-10 Thread Jack Ye
+1 for adding this in the REST spec.

Glue has a similar API GetTemporaryGlueTableCredentials [1], which was
introduced because of performance and also security reasons. For example,
we don't want to propagate credentials across the compute nodes in the
cluster, and each compute node needs to fetch only the credentials
independently. Such an API becomes handy to do improvements like caching.

Best,
Jack Ye

[1]
https://docs.aws.amazon.com/cli/latest/reference/lakeformation/get-temporary-glue-table-credentials.html


On Thu, Oct 10, 2024 at 3:47 AM Eduard Tudenhöfner 
wrote:

> Hey everyone,
>
> I'd like to propose a mechanism and changes in order to be able to refresh
> vended credentials for tables.
>
> Please find the proposal doc here
> 
> .
> The proposal requires a spec change, which can be seen in #11281
> .
>
> As discussed in the last sync, this should hopefully help in better
> understanding the proposal around standardizing credentials in the OpenAPI
> spec, which is being discussed in
> https://lists.apache.org/thread/jmklpnywnghg7qwmwr14zj2k6tnxmdo4.
>
> Thanks,
> Eduard
>


Re: [Discuss] Iceberg community maintaining the docker images

2024-10-10 Thread rdb...@gmail.com
I was specifically replying to this suggestion to add docker images for
Trino and Spark:

> I also envision the Iceberg community maintaining some quick-start Docker
images, such as spark-iceberg-rest, Trino-iceberg-rest, among others.

It sounds like we're mostly agreed that the Iceberg project itself isn't a
good place to do that. As for an image that is for catalog implementations
to test against, I think that's a good idea (supporting testing and
validation).

On Thu, Oct 10, 2024 at 10:56 AM Jean-Baptiste Onofré 
wrote:

> It's actually what I meant by REST Catalog docker image for test.
>
> Personally, I would not include any docker images in the Iceberg project
> (but more in the "iceberg" ecosystem, which is different from the project
> :)).
>
> However, if the community has a different view on that, no problem.
>
> Regards
> JB
>
> On Thu, Oct 10, 2024 at 9:50 AM Daniel Weeks  wrote:
>
>> I think we should focus on the docker image for the test REST Catalog
>> implementation.  This is somewhat different from the TCK since it's used by
>> the python/rust/go projects for testing the client side of the REST
>> specification.
>>
>> As for the quickstart/example type images, I'm open to discussing what
>> makes sense here, but we should decouple that and other docker images from
>> getting a test REST catalog image out.  (Seems like there's general
>> consensus around that).
>>
>> -Dan
>>
>> On Thu, Oct 10, 2024 at 4:29 AM Ajantha Bhat 
>> wrote:
>>
>>> Yes, the PRs I mentioned are about running TCK as a docker container and
>>> keeping/maintaining that docker file in the Iceberg repo.
>>>
>>> I envisioned maintaining other docker images also because I am not sure
>>> about the roadmap of the ones in our quickstart
>>>  (example:
>>> tabulario/spark-iceberg).
>>>
>>> Thanks,
>>> Ajantha
>>>
>>> On Thu, Oct 10, 2024 at 3:50 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
 Hi

 I think there's context missing here.

 I agree with Ryan that Iceberg should not provide any docker image or
 runtime things (we had the same discussion about REST server).

 However, my understanding is that this discussion is also related to
 the REST TCK. The TCK validation run needs a runtime, and I remember a
 discussion we had with Daniel (running TCK as a docker container).

 Regards
 JB

 On Wed, Oct 9, 2024 at 2:20 PM rdb...@gmail.com 
 wrote:

> I think it's important for a project to remain focused on its core
> purpose, and I've always advocated for Iceberg to remain a library that is
> easy to plug into other projects. I think that should be the guide here as
> well. Aren't projects like Spark and Trino responsible for producing easy
> to use Docker images of those environments? Why would the Iceberg project
> build and maintain them?
>
> I would prefer not to be distracted by these things, unless we need
> them for cases like supporting testing and validation of things that are
> part of the core purpose of the project.
>
> On Tue, Oct 8, 2024 at 6:08 AM Ajantha Bhat 
> wrote:
>
>> Hello everyone,
>>
>> Now that the test fixtures are in [1],we can create a runtime JAR for
>> the REST catalog adapter [2] from the TCK.
>> Following that, we can build and maintain the Docker image based on
>> it [3].
>>
>> I also envision the Iceberg community maintaining some quick-start
>> Docker images, such as spark-iceberg-rest, Trino-iceberg-rest, among 
>> others.
>>
>> I've looked into other Apache projects, and it seems that Apache
>> Infra can assist us with this process.
>> As we have the option to publish Iceberg docker images under the
>> Apache Docker Hub account.
>>
>> [image: image.png]
>>
>> I am more than willing to maintain this code, please find the PRs
>> related to the same [2] & [3].
>>
>> Any suggestions on the same? contributions are welcome if we agree to
>> maintain it.
>>
>> [1] https://github.com/apache/iceberg/pull/10908
>> [2] https://github.com/apache/iceberg/pull/11279
>> [3] https://github.com/apache/iceberg/pull/11283
>>
>> - Ajantha
>>
>


Re: [VOTE] Table V3 Spec: Row Lineage

2024-10-10 Thread Daniel Weeks
+1

Thanks Russell!

On Thu, Oct 10, 2024 at 6:57 AM Eduard Tudenhöfner 
wrote:

> I left a few comments on the proposal but I'm overall +1 on the proposal
>
> On Thu, Oct 10, 2024 at 12:08 PM Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> I did a review on the proposal and it looks good to me.
>>
>> Regards
>> JB
>>
>> On Tue, Oct 8, 2024 at 3:55 PM Russell Spitzer
>>  wrote:
>> >
>> > Hi Y'all!
>> >
>> > I think we are more or less in agreement on adding Row Lineage to the
>> spec apart from a few details which may change a bit during implementation.
>> Because of this, I'd like to call for an overall vote on whether or not
>> Row-Lineage as described in  PR 11130 can be added to the spec.
>> >
>> > I'll note this is basically giving a thumbs up for reviewers and
>> implementers to go ahead with the pull-request and acknowledging that you
>> support the direction this proposal is going. I do think we'll probably dig
>> a few things out when we write the reference implementation, but I think in
>> general we have defined the required behaviors we want to see.
>> >
>> > Please vote in the next 72 hours
>> >
>> > [ ] +1, commit the proposed spec changes
>> > [ ] -0
>> > [ ] -1, do not make these changes because . . .
>> >
>> >
>> > Thanks everyone,
>> >
>> > Russ
>>
>


Re: [DISCUSS] Defining a concept of "externally owned" tables in the REST spec

2024-10-10 Thread Jack Ye
Thanks Dennis for raising this! I had a similar discussion last year [1]
that I definitely want to discuss more. But I feel the main focus of this
discussion is less about external tables, but more about federation vs
notification. For this topic, I have 2 questions:

(1) To support federation, what is still missing in the REST spec?

It feels to me that we don't need anything new. As long as the REST server
supports hooking up to another REST server and adds the proxy routing
logic, it would work. For example, think about a Polaris catalog federating
a Unity catalog, Unity catalog will show as a "unity" namespace in Polaris,
and a table ns1.table1 will be shown as unity.ns1.table1 through Polaris. A
LoadTable(unity.ns1, table1) against this table will hit Polaris, and
Polaris as a proxy calls Unity with LoadTable(ns1, table1). And you can
imagine that you can do this proxy routing for every single API.

One nit is that maybe mapping a catalog to a namespace level feels hacky,
and we could add another catalog level in the REST spec with CRUD APIs for
catalogs. That concept is already there in Polaris, maybe worth adding to
Iceberg REST? But that is not strictly required for federation to work
since we have the namespace approach.

(2) if we all do federation, do we still need the notification approach?

If the final goal is just to enable access of one catalog in another
catalog, federation already serves the goal. There could be caching built
at the proxy layer, and cache eviction could be improved with
notifications. But that feels like an optimization. What is the value for
having both approaches at the same time?

Best,
Jack Ye

[1] https://lists.apache.org/thread/ohqfvhf4wofzkhrvff1lxl58blh432o6

On Wed, Oct 9, 2024 at 5:51 PM Dennis Huo  wrote:

> Summarizing discussion from today's Iceberg Catalog Community Sync, here
> were some of the key points:
>
>- General agreement on the need for some flavors of mechanisms for
>catalog federation in-line with this proposal
>- We should come up with a more fitting name for the endpoint other
>than "notifications"
>   - Some debate over whether to just add behaviors to updateTable or
>   registerTable endpoints; ultimately agreed that the behavior of these
>   tables is intended to be fundamentally different, and want to avoid
>   accidentally dangerous implementations, so it's better to have a 
> different
>   endpoint
>   - The idea of "Notifications" in itself is too general for this
>   purpose, and we might want something in the future that is more in-line
>   with much more generalized Notifications and don't want a conflict
>   - This endpoint focuses on the semantic of "force-update" without a
>   standard Iceberg commit protocol
>- The endpoint should potentially be a "bulk endpoint" since the use
>case is more likely to want to reflect batches at a time
>   - Some debate over whether this is strictly necessary, and whether
>   there would be any implicit atomicity expectations
>   - For this use case the goal is explicitly *not* to perform a
>   heavyweight commit protocol, so a bulk API is just an optimization to 
> avoid
>   making a bunch of individual calls; some or all of the requests in the 
> bulk
>   request could succeed or fail
>- The receiving side should not have structured failure modes relating
>to out-of-sync state -- e.g. the caller should not be depending on response
>state to determine consistency on the sending side
>   - This was debated with pros/cons of sending meaningful response
>   errors
>   - Pro: Useful for the caller to receive some amount of feedback to
>   know whether the force-update made it through, whether there are other
>   issues preventing syncing, etc
>   - Con: This is likely a slippery-slope of scope creep that still
>   fundamentally only partially addresses failure modes; instead, the 
> overall
>   system must be designed for idempotency of declared updated state and if
>   consistency is desired, the caller must not rely only on responses to
>   reconcile state anyways
>- We want to separate out the discussion of the relative merits of a
>push vs pull model of federation, so the merits of pull/polling/readthrough
>don't preclude adding this push-based endpoint
>   - In-depth discussion of relative pros/cons, but agreed that one
>   doesn't necessarily preclude the other, and this push endpoint targets a
>   particular use case
>- Keep the notion of "external tables" only "implicit" instead of
>having to plumb a new table type everywhere (for now?)
>   - We could document the intended behavior of tables that come into
>   existence from this endpoint having a different "ownership" semantic 
> than
>   those created by createTable/registerTable, but it REST spec itself 
> doesn't
>   necessarily need to expose any specific 

Re: [VOTE] Table V3 Spec: Row Lineage

2024-10-10 Thread Jack Ye
+1, overall agree that we should add this!

Best,
Jack Ye

On Thu, Oct 10, 2024 at 1:43 PM Daniel Weeks  wrote:

> +1
>
> Thanks Russell!
>
> On Thu, Oct 10, 2024 at 6:57 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> I left a few comments on the proposal but I'm overall +1 on the proposal
>>
>> On Thu, Oct 10, 2024 at 12:08 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1
>>>
>>> I did a review on the proposal and it looks good to me.
>>>
>>> Regards
>>> JB
>>>
>>> On Tue, Oct 8, 2024 at 3:55 PM Russell Spitzer
>>>  wrote:
>>> >
>>> > Hi Y'all!
>>> >
>>> > I think we are more or less in agreement on adding Row Lineage to the
>>> spec apart from a few details which may change a bit during implementation.
>>> Because of this, I'd like to call for an overall vote on whether or not
>>> Row-Lineage as described in  PR 11130 can be added to the spec.
>>> >
>>> > I'll note this is basically giving a thumbs up for reviewers and
>>> implementers to go ahead with the pull-request and acknowledging that you
>>> support the direction this proposal is going. I do think we'll probably dig
>>> a few things out when we write the reference implementation, but I think in
>>> general we have defined the required behaviors we want to see.
>>> >
>>> > Please vote in the next 72 hours
>>> >
>>> > [ ] +1, commit the proposed spec changes
>>> > [ ] -0
>>> > [ ] -1, do not make these changes because . . .
>>> >
>>> >
>>> > Thanks everyone,
>>> >
>>> > Russ
>>>
>>


Spec changes for deletion vectors

2024-10-10 Thread rdb...@gmail.com
Hi everyone,

There seems to be broad agreement around Anton's proposal to use deletion
vectors in Iceberg v3, so I've opened two PRs that update the spec with the
proposed changes. The first, PR #11238
, adds a new Puffin
blob type, delete-vector-v1, that stores a delete vector. The second, PR
#11240 , updates the
Iceberg table spec.

Please take a look and comment!

Ryan