Re: [DISCUSS] Donation of Dremio Auth Manager to the Apache Iceberg project

2025-07-29 Thread Ryan Blue
as we introduce >> >> more features, it will become impractical to keep it there, especially >> >> since some of the features will require third-party dependencies. As a >> >> data point: the new manager contains almost 100 Java production >> >> classes

Re: [VOTE] Update the table statistics (puffin stats) spec

2025-07-28 Thread Ryan Blue
+1 thanks for looking into this. On Mon, Jul 28, 2025 at 3:39 PM Amogh Jahagirdar <2am...@gmail.com> wrote: > +1 to fixing this to be a long > > On Mon, Jul 28, 2025 at 4:38 PM Kevin Liu wrote: > >> +1 Great catch. I also did a search for "snapshot-id" in >> https://iceberg.apache.org/spec/ ever

Re: Thoughts on Adding a `doc` Property for Schema Objects

2025-07-25 Thread Ryan Blue
that affect reading and writing and is not intended to be used for > arbitrary metadata.” Based on this, a comment seems to fall under > “arbitrary metadata,” and therefore may not be an appropriate use of > properties. > - Table comments seem to have become significant enough that relying

Re: Iceberg 1.10.0 release update - July 1, 2025

2025-07-25 Thread Ryan Blue
I thought that we said we wanted to get support out for v3 features in this release unless there is some reasonable blocker, like Spark not having geospatial types. To me, I think that means we should aim to get variant and unknown done so that we have a complete implementation with a major engine.

Re: Thoughts on Adding a `doc` Property for Schema Objects

2025-07-24 Thread Ryan Blue
Iceberg does allow you to store table descriptions. The convention is to use a table property, "comment". While this isn't a schema-level doc/comment, I don't know of anything that makes a distinction between schema description and table description, so I think it should work for your use. On Tue,

Re: [DISCUSS] v4 - Improved column statistics

2025-07-22 Thread Ryan Blue
gt;>>> Eduard >>>> >>>> On Tue, Jul 8, 2025 at 4:09 AM Jacky Lee wrote: >>>> >>>>> +1 for the wonderful feature. Please count me in if you need any help. >>>>> >>>>> Gábor Kaszab 于2025年7月7日周一 21:22写道: >>>>> > >>&

Re: [DISCUSS] Iceberg REST FGAC proposal

2025-07-22 Thread Ryan Blue
; folding specially with context resolution so that final projection list aka > transforms along with the filters apply can be conveyed back, > > > > It will be really nice to see this move forward ! > > > > Best, > > Prashant Singh > > > > > > On Mo

Re: [DISCUSS] Iceberg REST FGAC proposal

2025-07-21 Thread Ryan Blue
I agree with Russell. The proposal doesn't look too controversial given previous discussions on how to support FGAC managed by the catalog. I also agree that a more detailed proposal should use Iceberg expressions and transforms for the row-level filters and column mask expressions, and catalog-man

Re: [Discuss] Proposal to support set(metadata) on TableOperations

2025-07-09 Thread Ryan Blue
ore as well :) I want a Catalog api for >> "register" directly then each implementation can decide how that gets >> applied. >> >> On Wed, Jul 9, 2025 at 4:13 PM Ryan Blue wrote: >> >>> Thanks for bringing this up, Hongyue. I think the logic here mak

Re: [Discuss] Proposal to support set(metadata) on TableOperations

2025-07-09 Thread Ryan Blue
Thanks for bringing this up, Hongyue. I think the logic here makes sense and that `commit(base, new)` probably isn't a good API to use for `registerTable`. But my main objection is that I don't think that it makes sense to use `TableOperations` for this. Adding a `set` method is awkward because the

Re: [DISCUSS] Replace table transaction in REST Catalog

2025-07-07 Thread Ryan Blue
an overloading >> REPLACE. >> >> It seems to me that the problem is that we never fixed the partition bug >> >> I think most database vendors or SQL users don't reason about REPLACE in >> the bespoke manner of iceberg's current reference implementation. While &g

Re: cleanExpiredMetadata in RemoveSnapshots

2025-07-07 Thread Ryan Blue
t;> >>>>>> >>> >>>>>> Sure, keeping the default as false makes sense because this is a >>> new feature, so let's be on the safe side. >>> >>>>>> >>> >>>>>> About exposing setting the flag in the Spark action/pr

Re: [DISCUSS] Proposal for Iceberg 1.9.2 Release to Fix Critical REST Client Issue

2025-07-07 Thread Ryan Blue
.2 is a patch release, from my understanding, it should not cause > behaviour changes, as treating 503 the same as 502 / 504 causes behaviour > changes, IMHO ! > > Given that PR-13352 <https://github.com/apache/iceberg/pull/13352> got > in, I started a vote thread for the same.

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

2025-06-26 Thread Ryan Blue
t's truly necessary. > > [1] > https://docs.snowflake.com/en/developer-guide/pushdown-optimization#example-of-indirect-data-exposure-through-pushdown > Yufei > > > On Wed, Jun 18, 2025 at 12:32 PM Ryan Blue wrote: > >> Yufei, could you make the argument for supporti

Re: [DISCUSS] Proposal for Iceberg 1.9.2 Release to Fix Critical REST Client Issue

2025-06-26 Thread Ryan Blue
;> retry with 502 and 504 we can conflict with ourselves, as we don't know >>>> when we receive the status of 429 is it because we retried on 502 and then >>>> got 429 or something else happened, so I thought it's better to throw the >>>> commit sta

Re: [DISCUSS] Replace table transaction in REST Catalog

2025-06-26 Thread Ryan Blue
I think that we should definitely fix the partition spec bug. This is one reason why we lazily bind specs in newer implementations, and there are other ways to get into this situation because the table state where an older partition spec references a column ID that no longer exists is perfectly val

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

2025-06-18 Thread Ryan Blue
I've updated the design document[1] based on the previous comments. >>>>>> Additionally, I've included the SQL UDF syntax supported by various >>>>>> vendors, including Dremio, Snowflake, Databricks, and Trino. >>>>>> >>>>>

Re: [DISCUSS] Proposal for Iceberg 1.9.2 Release to Fix Critical REST Client Issue

2025-06-18 Thread Ryan Blue
Are we confident that the REST client fix is the right path? If I understand correctly, the issue is that there is a load balancer that is converting a 500 response to a 503 that the client will automatically retry, even though the 500 response is not retry-able. Then the retry results in a 409 be

Re: [DISCUSS] Donation of Dremio Auth Manager to the Apache Iceberg project

2025-06-18 Thread Ryan Blue
I think it would be great to bring this functionality into Iceberg. I'm curious about your plan for getting it in. It sounds like you're suggesting adding the Dremio project to the Iceberg repo and making it optional. Why not contribute the functionality directly to the AuthManager already in Icebe

Re: [DISCUSS] Kafka Connect delta writer support

2025-06-10 Thread Ryan Blue
I'm strongly against writing equality deletes from the KC writer because it can't sort to make the deletes more efficient to apply. I don't think that equality deletes should be used in situations like this and that it is only going to cause pain for users that don't understand that they need offli

Re: [DISCUSS] v4 - Improved column statistics

2025-06-05 Thread Ryan Blue
> I think it does not make sense to stick manifest files to Avro if we break column stats into sub fields. This isn't necessarily true. Avro can benefit from better pushdown with Eduard's approach as well by being able to skip more efficiently. With the current layout, Avro stores a list of key/va

Re: [DISCUSS] June board report

2025-06-05 Thread Ryan Blue
t; include the PMC members, meaning that Iceberg has 13 committers (not PMC > member). The board report tool (old one) should be clearer about that. > > Regards > JB > > Le mer. 4 juin 2025 à 16:20, Ryan Blue a écrit : > >> Hi everyone, >> >> Here’s my d

[DISCUSS] June board report

2025-06-04 Thread Ryan Blue
Hi everyone, Here’s my draft of our board report for June. I went through the old syncs for highlights, but please reply if you want me to add any more! Ryan Description: Apache Iceberg is a table format for huge analytic datasets that is designed for high performance and ease of use. Project St

Re: [DISCUSS] Reduce memory pressure due to column stats in position delete files

2025-06-04 Thread Ryan Blue
I think we can discard column stats for position deletes, as long as the data file path is preserved (as it is in #13161). For position deletes, we need to preserve the stats for any equality ID columns. That reduces false positives by ensuring that the IDs being deleted might be in the data file t

Re: [DISCUSS] v4 - One file commits

2025-05-30 Thread Ryan Blue
al in this area and it would be great to >>>> collaborate with different folks and exchange ideas here, since I think a >>>> lot of people are interested in solving this problem. >>>> >>>> Thanks, >>>> Amogh Jahagirdar >>>> >>>>

[DISCUSS] v4 - One file commits

2025-05-29 Thread Ryan Blue
Hi everyone, Like Russell’s recent note, I’m starting a thread to connect those of us that are interested in the idea of changing Iceberg’s metadata in v4 so that in most cases committing a change only requires writing one additional metadata file. *Idea: One-file commits* The current Iceberg me

Re: [DISCUSS] Enabling more Meetups

2025-05-29 Thread Ryan Blue
I agree with Russell here. The goal is to clarify how to run a meetup that meets our requirements, rather than approving them individually. I like Max's addition to make anyone starting one aware of the brand guidelines. I also like Danica's suggestions so that we state that we expect meetups to g

Re: [DISCUSS] API: Rename RowDelta deleteFile to removeRows

2025-05-29 Thread Ryan Blue
+1 It is good to have consistency within the RowDelta API. And I think it is a good idea in general to use "remove" to refer to removing a file from metadata, rather than "delete" because you can add or remove delete files. On Thu, May 29, 2025 at 11:46 AM Russell Spitzer wrote: > Ryan pointed

Re: [Discuss] Make identity(String sourceName, String targetName) Public

2025-05-29 Thread Ryan Blue
Sorry, I didn't come back to this after I initially read it. I think it's fine to make this change because we can definitely have identity transform partition fields that don't match after a rename. If I remember correctly, the reason for not making this public was just to ensure partition field na

Re: [DISCUSS] Enabling more Meetups

2025-05-27 Thread Ryan Blue
JB, can you give us a bit more context about why you're recommending those pages? Do they have policies that already do what is being suggested? Do they impose limits that mean we could not do this? Without that information, I'm not sure what I'm looking for in those docs. On Sat, May 24, 2025 at

[RESULT] [VOTE] Adopt the v3 spec changes

2025-05-22 Thread Ryan Blue
12:56 AM Jean-Baptiste Onofré > wrote: > >> +1 (non binding) >> >> Regards >> JB >> >> On Mon, May 19, 2025 at 11:20 PM Ryan Blue wrote: >> > >> > Hi everyone, >> > >> > With the follow-ups from the earlier discussion th

Re: Core changes for Flink Dynamic Iceberg Sink

2025-05-20 Thread Ryan Blue
Max, Can you use the factory methods in Expressions rather than changing visibility? Also, I don’t think that making SchemaUpdate public is a good idea. It has a public interf

Re: Discuss proposal - IRC APIs for Multi-Statement Multi-Table Transactions

2025-05-20 Thread Ryan Blue
(optionally) expose this extra catalog information to clients and not need to change how loading works. Ryan On Tue, May 20, 2025 at 9:45 AM Ryan Blue wrote: > Hi everyone, > > To avoid passing copies of a file around for comments, I put the doc for > commit sequence numbers in

Re: Discuss proposal - IRC APIs for Multi-Statement Multi-Table Transactions

2025-05-20 Thread Ryan Blue
Hi everyone, To avoid passing copies of a file around for comments, I put the doc for commit sequence numbers into Google so we can comment on a central copy: https://docs.google.com/document/d/1jr4Ah8oceOmo6fwxG_0II4vKDUHUKScb/edit?usp=sharing&ouid=100239850723655533404&rtpof=true&sd=true Ryan

Re: [VOTE] Adopt the v3 spec changes

2025-05-19 Thread Ryan Blue
/writers for core object models for unknown, timestamp(9) types - Implemented default values and updated read paths - Reviewed table encryption PRs On Mon, May 19, 2025 at 3:20 PM Ryan Blue wrote: > Hi everyone, > > With the follow-ups from the earlier discussion thread wrapped up, I

[VOTE] Adopt the v3 spec changes

2025-05-19 Thread Ryan Blue
Hi everyone, With the follow-ups from the earlier discussion thread wrapped up, I’d like to raise a vote to adopt the v3 spec changes . *What is included?* - Default values for columns and f

Re: [VOTE] Release Apache Iceberg 1.9.1 RC0

2025-05-19 Thread Ryan Blue
I think we should address the problem that Aihua pointed out. Even if we can technically say that we are following the spec, this is a behavior change that is known to break with existing REST catalog services. I don't think that we should release a version that is known to break with existing serv

Re: Spark 4.0/Iceberg Integration Merged – Spark 3.5 Merges Can Resume

2025-05-15 Thread Ryan Blue
I agree, thank you for working on this! It's great to have this merged. On Thu, May 15, 2025 at 7:47 AM Russell Spitzer wrote: > Thanks for getting this in! > > On Wed, May 14, 2025 at 5:39 PM huaxin gao wrote: > >> Dear all, >> >> Thank you so much for your patience and support! >> >> The Spar

Re: [VOTE] File Format API

2025-05-15 Thread Ryan Blue
I definitely support introducing an API for this purpose and I think that the current work is the right direction. But I'm not sure that a vote is the right next step. A vote should be used to confirm consensus on a design and direction, and I thought the next steps were to build that consensus aro

Re: [VOTE] Clarify writer requirements in the spec to prevent orphan DVs

2025-05-14 Thread Ryan Blue
+1 (binding) Thanks, Anton! On Wed, May 14, 2025 at 9:41 AM Yufei Gu wrote: > +1 (binding) Thanks Anton! > Yufei > > > On Wed, May 14, 2025 at 9:36 AM Steven Wu wrote: > >> +1 (binding) >> >> On Wed, May 14, 2025 at 9:31 AM Akashdeep Gupta < >> gupta.akashde...@gmail.com> wrote: >> >>> +1 (non

Re: Should DDL operations always create new snapshots?

2025-05-12 Thread Ryan Blue
Snapshots are created when data changes and there is no change to the data tree at “time” v3. If you want to create new snapshots when the schema changes it is alright to do it, but I don’t think that we need to require it in the spec. Also, it isn’t clear to me why the time travel query would res

Re: [DISCUSS] [REST SPEC] Add first-row-id in the data files for Row Lineage

2025-05-12 Thread Ryan Blue
I thought sure I had a PR that added this, but I can't find it. +1 to adding `first_row_id`. Thanks, Prashant! On Mon, May 12, 2025 at 9:22 AM Russell Spitzer wrote: > Makes sense to me, perhaps we should also add in a test that checks that > the Datafile api object and the REst spec are always

Re: [VOTE] Merge details about GZip metadata files to the spec.

2025-05-12 Thread Ryan Blue
+1 (binding) On Mon, May 12, 2025 at 10:50 AM Szehon Ho wrote: > +1 (binding) > > Thanks > Szehon > > On Mon, May 12, 2025 at 9:19 AM Russell Spitzer > wrote: > >> +1 (binding) >> >> On Mon, May 12, 2025 at 5:32 AM Eduard Tudenhöfner < >> etudenhoef...@apache.org> wrote: >> >>> +1 (binding) >>>

[RESULT] [VOTE] Add encryption key updates to REST spec

2025-05-12 Thread Ryan Blue
With 5 +1 votes and no -1 or +0 votes, this passes. Thanks, everyone! On Fri, May 9, 2025 at 3:00 PM Denny Lee wrote: > +1 (non-binding) > > On Fri, May 9, 2025 at 14:11 Ryan Blue wrote: > >> +1 (binding) >> >> On Thu, May 8, 2025 at 10:33 AM Russell Spitzer

Re: [VOTE] Add encryption key updates to REST spec

2025-05-09 Thread Ryan Blue
hoef...@apache.org> wrote: >> >>> +1 (binding) >>> >>> On Thu, May 8, 2025 at 5:23 PM Ryan Blue wrote: >>> >>>> Hi everyone, >>>> >>>> I’d like to raise a vote for committing PR 12987 >>>> <https://git

[VOTE] Add encryption key updates to REST spec

2025-05-08 Thread Ryan Blue
Hi everyone, I’d like to raise a vote for committing PR 12987 that adds table updates for encryption keys, AddEncryptionKey and RemoveEncryptionKey. These are needed to maintain the encryption key list in v3 metadata. Please vote in the next 72 hours

Re: [VOTE] Minor clarification for Geo Spec

2025-05-06 Thread Ryan Blue
+1 (binding) Thanks to Jia and Szehon for the quick turn-around getting this done! On Tue, May 6, 2025 at 2:37 PM Jia Yu wrote: > +1 (non-binding) > > Thanks for putting this together! > > Jia Yu > > On Tue, May 6, 2025 at 2:09 PM Szehon Ho wrote: > >> Hi everyone, >> >> As discussed briefly i

Re: [DISCUSS] Finalizing the v3 spec

2025-05-06 Thread Ryan Blue
ав. 2025 р. о 23:23 Jean-Baptiste Onofré пише: >> >>> Hi Ryan >>> >>> All good for the spec. The idea for release is just a help to "double >>> check" the spec is good (we already saw some slightly changes on the >>> spec while working on

[RESULT] [VOTE] Add encryption keys to table metadata

2025-05-05 Thread Ryan Blue
;>>> wrote: >>>> >>>>> +1 (non-binding) >>>>> >>>>> ~ Anurag Mantripragada >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>

Re: [DISCUSS] Finalizing the v3 spec

2025-05-01 Thread Ryan Blue
; >>>> Thanks, >>>> Jia >>>> >>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang >>>> wrote: >>>> >>>>> Agree with Russell and JB that we make a "RC" release for V3 spec to >>>>> test impleme

Re: [DISCUSS] Spec update to cover compressed JSON metadata files

2025-05-01 Thread Ryan Blue
t;> >> I think any changes to naming convention would have to be done as part of >> a new version of the spec (and file system based commits must be completely >> removed as of that version). >> >> I think ZSTD could be useful but that again is a strict improve

Re: [VOTE] Add encryption keys to table metadata

2025-04-30 Thread Ryan Blue
om> wrote: > >> +1 >> >> On Wed, Apr 30, 2025 at 11:36 AM Szehon Ho >> wrote: >> >>> +1 >>> >>> Thanks >>> Szehon >>> >>> On Wed, Apr 30, 2025 at 4:10 AM Eduard Tudenhöfner < >>> etudenhoef...@apac

[DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Ryan Blue
Hi everyone, I think we’ve reached the point where it’s time to finalize and adopt the changes for Iceberg v3. We’ve been working toward this for the last few months and have now implemented the v3 features in the Java library to reduce the risk of needing changes or hitting problems (row lineage

[VOTE] Add encryption keys to table metadata

2025-04-29 Thread Ryan Blue
Hi everyone, I’d like to propose merging PR 12162 into the table spec for v3. The changes are a minimal set of additions needed to support table encryption schemes, including the scheme that we’re working on for table encryption with client-mana

Re: [Discuss] Streamlining Release Notes Preparation

2025-04-28 Thread Ryan Blue
I don't see much value in a release notes file. I think this kind of approach gets ignored and sets up a bad situation where release managers assume that notes are up to date when they aren't. That leads to poorer quality release notes. I think it is reasonable either to use the set of changes that

Re: Feathercast: 1.9.0 release

2025-04-28 Thread Ryan Blue
I'd be happy to chat about 1.9. A lot of implementation for v3 features went in, including: * timestamp(9), unknown, and and variant types * Variant pushdown and metadata * Row lineage core changes Ryan On Mon, Apr 28, 2025 at 10:29 AM Ajantha Bhat wrote: > Hi, > I would like to see if any PMC

Re: [DISCUSS] Spec update to cover compressed JSON metadata files

2025-04-28 Thread Ryan Blue
It would be great to mention how to determine the compression of the metadata JSON file in the spec. Thanks for bringing this up. It makes sense to me to use the file name and get a bit more strict about this. That said, we will need to make sure that the current default behavior is documented and

[RESULT] [VOTE] Small spec change for default values

2025-04-24 Thread Ryan Blue
ut actual >> use cases. >> >> - Anton >> >> ср, 23 квіт. 2025 р. о 10:27 Ryan Blue пише: >> >>> +1 (binding) >>> >>> On Wed, Apr 23, 2025 at 8:39 AM Fokko Driesprong >>> wrote: >>> >>>> +1 (binding) >>&g

Re: [VOTE] Make namespace separator configurable in REST Spec

2025-04-24 Thread Ryan Blue
While I agree that the configurable separator is the best solution that balances trade-offs, I don't think that we should move forward when there has been a veto from the community. In Iceberg and most ASF communities, votes are intended to confirm consensus --- not to make decisions. Since we don

Re: [VOTE] Small spec change for default values

2025-04-23 Thread Ryan Blue
gt; >>> +1 (non-binding) >>> >>> Best, >>> Prashant Singh >>> >>> On Tue, Apr 22, 2025 at 2:55 AM Eduard Tudenhöfner < >>> etudenhoef...@apache.org> wrote: >>> >>>> +1 >>>> >>>> On Tue,

[VOTE] Small spec change for default values

2025-04-21 Thread Ryan Blue
Hi everyone, I’d like to vote on the spec changes in PR 12841 . This is a small change that makes handling default values for structs much easier. Initially, we allowed both a struct and its fields to have default values, but the values could conflict.

[RESULT] [VOTE] Update row lineage spec ID assignment

2025-04-19 Thread Ryan Blue
This passes with 11 +1 votes and no -1 or +0 votes. Thanks, everyone! On Fri, Apr 18, 2025 at 4:13 AM Fokko Driesprong wrote: > +1 > > Op vr 18 apr 2025 om 08:09 schreef Jean-Baptiste Onofré : > >> +1 (non binding) >> >> Regards >> JB >> >>

Re: [VOTE] Spec Update: Variant Field Lower/Upper Bounds

2025-04-18 Thread Ryan Blue
+1 (binding) On Thu, Apr 17, 2025 at 8:27 PM Aihua Xu wrote: > Hi all, > > I'd like to initiate a vote to include a spec update for supporting lower > and upper bounds on Variant fields. Summary of the change: > > The writer determines which fields to collect bounds for a Variant column. > Field

Re: [VOTE] Simplify multi-argument field-id(s) encoding

2025-04-17 Thread Ryan Blue
+1 (binding) On Thu, Apr 17, 2025 at 10:22 AM Szehon Ho wrote: > +1 (binding) > > Thanks > Szehon > > On Thu, Apr 17, 2025 at 10:18 AM Daniel Weeks wrote: > >> +1 (binding) >> >> On Thu, Apr 17, 2025 at 8:41 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> +1 (Bind) >>> >>> On T

Re: [VOTE] Update row lineage spec ID assignment

2025-04-17 Thread Ryan Blue
Adding my own +1. On Thu, Apr 17, 2025 at 10:19 AM Daniel Weeks wrote: > +1 (binding) > > I think this update really helps ensure row ids will be present and > reliable for upgraded tables. Thanks Ryan! > > On Wed, Apr 16, 2025 at 4:09 PM Ryan Blue wrote: > >> Hi

[VOTE] Update row lineage spec ID assignment

2025-04-16 Thread Ryan Blue
Hi everyone, I’d like to start a vote to incorporate the spec changes in PR 12781 . There are two main changes. First, the current language says that upgrading a table to v3 leaves all row IDs null and they are assigned when the rows are rewritten for

Re: [DISCUSS] Fix CVE-2025-30065 on 1.8.x / 1.7.x / 1.6.x?

2025-04-14 Thread Ryan Blue
I agree with Fokko. It's a good idea to get a release out soon that has a fix for this, but we don't want to make unnecessary releases for things that aren't actual vulnerabilities. That's especially true in older branches, where we have reasonable guidelines for what goes in them already. It's bet

Re: [VOTE] Row lineage required for v3

2025-04-05 Thread Ryan Blue
+1 On Mon, Mar 31, 2025 at 12:01 PM Anton Okolnychyi wrote: > +1 (binding) > > - Anton > > пн, 31 бер. 2025 р. о 11:43 Daniel Weeks пише: > >> Hey Everyone, >> >> I'd like to raise the proposal to make row-lineage required >> by default to a vote.

Re: [VOTE] Minor simplifications for Geo Spec

2025-04-04 Thread Ryan Blue
+1 On Wed, Mar 19, 2025 at 1:01 PM Matt Topol wrote: > +1 (non-binding) > > On Wed, Mar 19, 2025, 2:02 PM Yufei Gu wrote: > >> +1 >> Yufei >> >> >> On Wed, Mar 19, 2025 at 10:42 AM Fokko Driesprong >> wrote: >> >>> +1 >>> >>> Kind regards, >>> Fokko >>> >>> Op wo 19 mrt 2025 om 18:32 schreef H

Re: [DISCUSS] Multi-arg transforms

2025-04-03 Thread Ryan Blue
Sorry I didn't see the discussion about adding a new bucket transform earlier. I think it's great to start talking about a new bucket transform, but we made sure that we could add new transforms without breaking forward-compatibility so that we didn't need to rush getting one in. I think that we're

Re: [DISCUSS] Row lineage required for v3

2025-03-25 Thread Ryan Blue
Okay, it sounds like we have consensus that it's a good idea to make row lineage required in v3 and that it's a good idea to signal to engines when they can write delete-and-insert changes. I think we need a bit more discussion on how to signal to engines, but in the meantime we can move forward wi

Re: [DISCUSS] Row lineage required for v3

2025-03-21 Thread Ryan Blue
rite > to tables where the property is true. This way the user could be confident > that the row lineage information is correct when the property is false. > > Thanks, Peter > > On Thu, Mar 20, 2025, 23:15 Ryan Blue wrote: > >> Now, if we make it required for V3 tables, what if user

Re: [DISCUSS] Row lineage required for v3

2025-03-20 Thread Ryan Blue
avior of the writer was >> without knowing what system wrote the data. >> >> On Thu, Mar 20, 2025 at 10:43 AM Ryan Blue wrote: >> >>> +1 for the PR and always having the lineage metadata. >>> >>> I think that is going to make the feature much more

Re: [DISCUSS] Row lineage required for v3

2025-03-20 Thread Ryan Blue
+1 for the PR and always having the lineage metadata. I think that is going to make the feature much more reliable. We don't gain anything from allowing the feature to be turned off for compatibility, when we have reasonable ways to interpret data written by any engine. Ryan On Wed, Mar 19, 2025

Re: [VOTE] Improve OpenAPI documentation around how NamespaceNotEmptyException is treated

2025-03-18 Thread Ryan Blue
to use 409 in order to indicate the >> NamespaceNotEmptyException >> >> On Mon, Mar 17, 2025 at 7:12 PM Christian Thiel < >> christian.t.b...@gmail.com> wrote: >> >>> +1 (non-binding) for the updated 409 Code >>> >>> On Fri, 14 Mar 2025 at 18:30, Ryan Blue wrote

Re: cleanExpiredMetadata in RemoveSnapshots

2025-03-15 Thread Ryan Blue
I don't think it is necessary to either make cleanup the default or to expose the flag in Spark or other engines. Right now, catalogs are taking on a lot more responsibility for things like snapshot expiration, orphan file cleanup, and schema or partition spec removal. Ideally, those are tasks tha

Re: [VOTE] Improve OpenAPI documentation around how NamespaceNotEmptyException is treated

2025-03-14 Thread Ryan Blue
>From the issue, it looks like we're using 400 for this because that's what the Java client was returning as a generic or unhandled error. I don't think that's a good reason to standardize on 400 now that we are calling out this error in the spec. Why not choose an error code that distinguishes it

Re: [DISCUSS] Rename iceberg repo to iceberg-java ?

2025-03-14 Thread Ryan Blue
I agree. Unless there is a benefit or compelling reason, renames are generally not worth the unnecessary work. On Fri, Mar 14, 2025 at 1:13 PM Hussein Awala wrote: > I agree with Russell that the spec should remain language-agnostic. > However, for the Java client (and other integrations impleme

[DISCUSS] March board report

2025-03-11 Thread Ryan Blue
Hi everyone, It’s time for a board report again. Here’s my current draft. Thanks to Dan for helping put it together while I’m unexpectedly out of the office. Sorry if it is light on activity for some languages! Please let me know if there’s anything I can add and I’ll try to get it in tomorrow. Th

Re: [DISCUSS] Row Lineage Proposal

2024-08-28 Thread Ryan Blue
e actual spec > changes on a spec change PR. > > I'm going to be keeping the proposals for : > > Global Identifier as the Identifier > and > Last Updated Sequence number as the Version > > > > On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue > wrote: > >>

Re: [DISCUSS] Drop Hive 2 support

2024-08-27 Thread Ryan Blue
ive 2 support yet, I think the path is to >>> deprecate Hive 2 in 1.7 and drop Hive 2 in 1.8. What do you think? >>> >>> [1] https://hive.apache.org/general/downloads/ >>> [2] https://github.com/apache/iceberg/pull/10932 >>> [3] https://github.com/apache/iceberg/pull/10996 >>> >>> Regards, >>> Manu >>> >> -- Ryan Blue Databricks

Re: [VOTE] Merge REST Spec change to add RemovePartitionSpecsUpdate update type

2024-08-26 Thread Ryan Blue
mokfmg2f18934qnln > > Thanks, > > Amogh Jahagirdar > -- Ryan Blue Databricks

Re: Type promotion in v3

2024-08-22 Thread Ryan Blue
g. >>>>> >>>>> >>>>> I think the idea with Parquet files is one would no longer use a map >>>>> to track these statistics but instead have a column per >>>>> field-id/statistics >>>>> pair. So for each column in

Re: [VOTE] Spec changes in preparation for v3

2024-08-22 Thread Ryan Blue
n, Aug 19, 2024 at 1:34 PM Yufei Gu wrote: >>>> >>>>> +1 >>>>> Yufei >>>>> >>>>> >>>>> On Mon, Aug 19, 2024 at 1:17 PM Fokko Driesprong >>>>> wrote: >>>>> >>>>&

Re: [DISCUSS] Adding RemovePartitionSpecsUpdate update type to REST

2024-08-22 Thread Ryan Blue
nknown type must fail with a 400 response as >>>> discussed/voted earlier. [3] >>>> >>>> [1] https://github.com/apache/iceberg/pull/10755 >>>> <https://github.com/apache/iceberg/pull/10755> >>>> [2] https://github.com/apache/iceberg/pull/10846/ >>>> [3] https://lists.apache.org/thread/99lo7stnprchjzosjcq9k3mns1mq8fwc >>>> >>>> Thanks, >>>> >>>> Amogh Jahagirdar >>>> >>> -- Ryan Blue Databricks

Re: [VOTE] REST Endpoint discovery

2024-08-20 Thread Ryan Blue
+1 On Tue, Aug 20, 2024 at 1:46 PM Christian Thiel wrote: > + 1 (non-binding) > > -- Ryan Blue Databricks

Re: [VOTE] Spec changes in preparation for v3

2024-08-19 Thread Ryan Blue
; +1 - Feels duplicative to vote here and approve on the PR >> >> On Mon, Aug 19, 2024 at 2:41 PM Ryan Blue wrote: >> >>> Hi everyone, >>> >>> I'd like to vote on PR #10948 >>> <https://github.com/apache/iceberg/pull/10948>, which

Re: Type promotion in v3

2024-08-19 Thread Ryan Blue
s resolution of the metadata constant time > (technically it would be linear in the number of promotions), instead of > requiring parsing/keeping old schemas for metadata about only a few fields. > > Thanks, > Micah > > > > > On Fri, Aug 16, 2024 at 4:00 PM Ryan Blue wro

[VOTE] Spec changes in preparation for v3

2024-08-19 Thread Ryan Blue
rms * Reset heading levels that were set to de-clutter the TOC in previous site frameworks This will be open for at least 72 hours. [ ] +1 [ ] -0 [ ] -1 do not make these changes because . . . -- Ryan Blue

Re: [DISCUSS] Row Lineage Proposal

2024-08-19 Thread Ryan Blue
entifier as >>> well as a marker of what version of the row it is. This should let us build >>> a variety of features related to CDC, Incremental Processing and Audit >>> Logging. If you are interested please check out the linked proposal below. >>> This will require compliance from all engines to be really useful so It's >>> important we come to consensus on whether or not this is possible. >>> >>> >>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing >>> >>> >>> Thank you for your consideration, >>> Russ >>> >> -- Ryan Blue Databricks

Re: Type promotion in v3

2024-08-19 Thread Ryan Blue
etadata: we can >>> persist schema_id in the DataFile. It still adds some extra size to the >>> manifest file but should be negligible? >>> >>> And I think there’s also another aspect to consider: whether the new >>> type promotion is compatible with partitio

Type promotion in v3

2024-08-16 Thread Ryan Blue
file schema would include full type information for the stats columns. Given the complexity of releasing Parquet manifests, I think it makes more sense to get a few promotion cases done now in v3 and follow up with the rest in v4. Ryan -- Ryan Blue

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Ryan Blue
>>>>>>>>>> Thanks, >>>>>>>>>>>>> Manu >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry < >>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support! >>>>>>>>>>>>>> >>>>>>>>>>>>>> Given the differences between the supported types and the >>>>>>>>>>>>>> lack of interest from the other project, I think it is >>>>>>>>>>>>>> reasonable to >>>>>>>>>>>>>> duplicate the specification to our repository. >>>>>>>>>>>>>> I would give very strong emphasis on sticking to the Spark >>>>>>>>>>>>>> spec as much as possible, to keep compatibility as much as >>>>>>>>>>>>>> possible. Maybe >>>>>>>>>>>>>> even revert to a shared specification if the situation changes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Peter >>>>>>>>>>>>>> >>>>>>>>>>>>>> Aihua Xu ezt írta (időpont: 2024. aug. >>>>>>>>>>>>>> 13., K, 19:52): >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks Russell for bringing this up. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is the main blocker to move forward with the Variant >>>>>>>>>>>>>>> support in Iceberg and hopefully we can have a consensus. To >>>>>>>>>>>>>>> me, I also >>>>>>>>>>>>>>> feel it makes more sense to move the spec into Iceberg rather >>>>>>>>>>>>>>> than Spark >>>>>>>>>>>>>>> engine owns it and we try to keep it compatible with Spark spec. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Aihua >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Y’all, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal, >>>>>>>>>>>>>>>> while we were hoping to move the Variant and Shredding >>>>>>>>>>>>>>>> specifications from >>>>>>>>>>>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest >>>>>>>>>>>>>>>> in that. >>>>>>>>>>>>>>>> Unfortunately, I think we have a number of issues with just >>>>>>>>>>>>>>>> linking to the >>>>>>>>>>>>>>>> Spark project directly from within Iceberg and *I believe >>>>>>>>>>>>>>>> we need to copy the specifications into our repository*. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There are a few reasons why i think this is necessary >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> First, we have a divergence of types already. The Spark >>>>>>>>>>>>>>>> Specification already includes types which Iceberg has no >>>>>>>>>>>>>>>> definition for (19, >>>>>>>>>>>>>>>> 20 >>>>>>>>>>>>>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types> >>>>>>>>>>>>>>>> - Interval Types) and Iceberg already has a type which is not >>>>>>>>>>>>>>>> included >>>>>>>>>>>>>>>> within the Spark Specification (Time) and will soon have more >>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>> TimestampNS, and Geo. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is not a hard >>>>>>>>>>>>>>>> dependency for other engines. We are working with several >>>>>>>>>>>>>>>> implementers of >>>>>>>>>>>>>>>> the Iceberg spec and it has previously been agreed that it >>>>>>>>>>>>>>>> would be best if >>>>>>>>>>>>>>>> the source of truth for Variant existed in an engine and file >>>>>>>>>>>>>>>> format >>>>>>>>>>>>>>>> neutral location. The Iceberg project has a good open model of >>>>>>>>>>>>>>>> governance >>>>>>>>>>>>>>>> and, as we have seen so far discussing Variant >>>>>>>>>>>>>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>, >>>>>>>>>>>>>>>> open and active collaboration. This would also help as we can >>>>>>>>>>>>>>>> strictly >>>>>>>>>>>>>>>> version our changes in-line with the rest of the Iceberg spec. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and >>>>>>>>>>>>>>>> requires some group analysis and discussion before we commit >>>>>>>>>>>>>>>> it. I think >>>>>>>>>>>>>>>> again the Iceberg community is probably the right place for >>>>>>>>>>>>>>>> this to happen >>>>>>>>>>>>>>>> as we have already started discussions here on these topics. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For these reasons I think we should go with a direct copy >>>>>>>>>>>>>>>> of the existing specification from the Spark Project and move >>>>>>>>>>>>>>>> ahead with >>>>>>>>>>>>>>>> our discussions and modifications within Iceberg. That said, *I >>>>>>>>>>>>>>>> do not want to diverge if possible from the Spark proposal*. >>>>>>>>>>>>>>>> For example, although we do not use the Interval types above, >>>>>>>>>>>>>>>> I think we >>>>>>>>>>>>>>>> should not reuse those type ids within our spec. Iceberg's >>>>>>>>>>>>>>>> Variant Spec types 19 and 20 would remain unused along with >>>>>>>>>>>>>>>> any other types >>>>>>>>>>>>>>>> we think are not applicable. We should strive whenever >>>>>>>>>>>>>>>> possible to allow >>>>>>>>>>>>>>>> for compatibility. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In the interest of moving forward with this proposal I am >>>>>>>>>>>>>>>> hoping to see if anyone in the community objects to this plan >>>>>>>>>>>>>>>> going forward >>>>>>>>>>>>>>>> or has a better alternative. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> As always I am thankful for your time and am eager to hear >>>>>>>>>>>>>>>> back from everyone, >>>>>>>>>>>>>>>> Russ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- Ryan Blue Databricks

Re: [DISCUSS] adoption of format version 3

2024-08-15 Thread Ryan Blue
an unknown partition transform is _true_ because the partition field is ignored and not used in filtering." That also cleans up some of the boilerplate to work for v2 and v3. I also have an update on type promotion, but that's a longer issue so I'll start a new thread. On Wed, Aug 7,

Re: [Early Feedback] Variant and Subcolumnarization Support

2024-08-15 Thread Ryan Blue
s to me that > it's never possible for contents of a variant value to contain a SQL Null > value (only a JSON NULL), such as array(1, missing, 2). Since a variant > value is recursive, there doesn't appear to be any way to encode a SQL NULL > in the actual Variant value. > >> > > >> > If anyone has any insights that can confirm or reject my > understanding, I'd greatly appreciate it. I'm trying to become more > familiar with the Variant encoded and this seemed like it could be a > potential "gotcha" once column shredding is supported. > >> > > >> > Thanks, > >> > Nick Riasanovsky > -- Ryan Blue Databricks

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Ryan Blue
>>>>>>> >>>>>>> Personally speaking, I am pretty neutral on this topic, but curious >>>>>>> what everyone thinks. >>>>>>> >>>>>>> Best, >>>>>>> Jack Ye >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 14, 2024 at 9:20 AM Eduard Tudenhöfner < >>>>>>> etudenhoef...@apache.org> wrote: >>>>>>> >>>>>>>> Hey Dmitri, >>>>>>>> >>>>>>>> this proposal is the result of the community feedback from the >>>>>>>> Capabilities proposal. Ultimately the capabilities turned out to entail >>>>>>>> more complexity than necessary and so this proposal solves the core >>>>>>>> problem >>>>>>>> while keeping complexity and spec changes to an absolute minimum. >>>>>>>> >>>>>>>> Eduard >>>>>>>> >>>>>>>> On Wed, Aug 14, 2024 at 5:15 PM Dmitri Bourlatchkov >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Eduard, >>>>>>>>> >>>>>>>>> How is this proposal related to the Server Capabilities discussion? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Dmitri. >>>>>>>>> >>>>>>>>> On Wed, Aug 14, 2024 at 5:14 AM Eduard Tudenhöfner < >>>>>>>>> etudenhoef...@apache.org> wrote: >>>>>>>>> >>>>>>>>>> Hey everyone, >>>>>>>>>> >>>>>>>>>> I'd like to propose a way for REST servers to communicate to >>>>>>>>>> clients what endpoints it supports via a new *endpoints* field >>>>>>>>>> in the *CatalogConfig* of the *v1/config* endpoint. >>>>>>>>>> >>>>>>>>>> This enables clients to make better decisions and clearly signal >>>>>>>>>> that a particular endpoint isn’t supported. >>>>>>>>>> >>>>>>>>>> I opened #10937 <https://github.com/apache/iceberg/issues/10937> to >>>>>>>>>> track the proposal in GH. Please find the proposal doc here >>>>>>>>>> <https://docs.google.com/document/d/1krcIaLfxtBFDABU5ssLmf64zyHgE8BRncpGPIMTWlxo/edit?usp=sharing> >>>>>>>>>> (estimated >>>>>>>>>> read time: 5 minutes). The proposal requires a Spec change, which >>>>>>>>>> can be >>>>>>>>>> seen in #10928 <https://github.com/apache/iceberg/pull/10928>. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Eduard >>>>>>>>>> >>>>>>>>> -- Ryan Blue Databricks

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Ryan Blue
ted discussions here on these topics. > > For these reasons I think we should go with a direct copy of the existing > specification from the Spark Project and move ahead with our discussions > and modifications within Iceberg. That said, *I do not want to diverge if > possible from the Spark proposal*. For example, although we do not use > the Interval types above, I think we should *not* reuse those type ids > within our spec. Iceberg's Variant Spec types 19 and 20 would remain unused > along with any other types we think are not applicable. We should strive > whenever possible to allow for compatibility. > > In the interest of moving forward with this proposal I am hoping to see if > anyone in the community objects to this plan going forward or has a better > alternative. > > As always I am thankful for your time and am eager to hear back from > everyone, > Russ > > Xuanwo > > https://xuanwo.io/ > > -- Ryan Blue Databricks

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Ryan Blue
>>>>>> Dmitri. >>>>>> >>>>>> On Wed, Aug 14, 2024 at 5:14 AM Eduard Tudenhöfner < >>>>>> etudenhoef...@apache.org> wrote: >>>>>> >>>>>>> Hey everyone, >>>>>>> >>>>>>> I'd like to propose a way for REST servers to communicate to clients >>>>>>> what endpoints it supports via a new *endpoints* field in the >>>>>>> *CatalogConfig* of the *v1/config* endpoint. >>>>>>> >>>>>>> This enables clients to make better decisions and clearly signal >>>>>>> that a particular endpoint isn’t supported. >>>>>>> >>>>>>> I opened #10937 <https://github.com/apache/iceberg/issues/10937> to >>>>>>> track the proposal in GH. Please find the proposal doc here >>>>>>> <https://docs.google.com/document/d/1krcIaLfxtBFDABU5ssLmf64zyHgE8BRncpGPIMTWlxo/edit?usp=sharing> >>>>>>> (estimated >>>>>>> read time: 5 minutes). The proposal requires a Spec change, which can be >>>>>>> seen in #10928 <https://github.com/apache/iceberg/pull/10928>. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Eduard >>>>>>> >>>>>> -- Ryan Blue Databricks

Re: [DISCUSS] Cleanup svn dev/iceberg

2024-08-14 Thread Ryan Blue
iceberg-0.3.0.tar.gz.sha512 >> Aiceberg-dev/pyiceberg-0.2.1rc0/pyiceberg-0.2.1.tar.gz >> Aiceberg-dev/pyiceberg-0.2.0rc1/pyiceberg-0.2.0-py3-none-any.whl.asc >> Aiceberg-dev/pyiceberg-0.2.0rc1/pyiceberg-0.2.0.tar.gz.sha512 >> Aiceberg-dev/pyiceberg-0.2.0rc0/p

Re: [DISCUSS] Changing namespace separator in REST spec

2024-08-13 Thread Ryan Blue
t. >>> >>> >>> On Mon, Aug 5, 2024 at 7:07 PM Daniel Weeks wrote: >>> >>>> I would agree with adding either a server side (config override) or >>>> client side control (query param with `?delim=.`) as it will be >>>> compatible with the current v1 endpoint. >>>> >>>> In the future we could introduce a v2 endpoint(s), but I would want to >>>> wait for OpenAPI 4 because they address this by allowing multi-segment >>>> pathing via URI templates in RFC 6570 >>>> <https://datatracker.ietf.org/doc/html/rfc6570>, which is the original >>>> way we wanted to represent namespaces, but it wasn't supported (e.g. >>>> .../{+namespaces}/tables/{table}). I doubt it's really worth the >>>> effort though, so I feel like a configurable delimiter makes the most >>>> sense. >>>> >>>> -Dan >>>> >>> -- >> Robert Stupp >> @snazy >> >> -- Ryan Blue Databricks

  1   2   3   4   5   6   7   8   9   10   >