[ANNOUNCE] Apache Iceberg Go Release v0.3.0

2025-05-29 Thread Matt Topol
Hello everyone, I'm pleased to announce the release of Apache Iceberg Go v0.3.0! Apache Iceberg is an open table format for huge analytic datasets, Iceberg delivers high query performance for tables with tens of petabytes of data, along with atomic commits, concurrent writes, and SQL-compatible t

Re: [DISCUSS] Enabling more Meetups

2025-05-29 Thread Ryan Blue
I agree with Russell here. The goal is to clarify how to run a meetup that meets our requirements, rather than approving them individually. I like Max's addition to make anyone starting one aware of the brand guidelines. I also like Danica's suggestions so that we state that we expect meetups to g

Re: [DISCUSS] Apache Iceberg 1.10.0 release

2025-05-29 Thread Steven Wu
> whether Spark 3.5 can perform some basic queries or provide file merging capabilities in the current or next version of V3? ConradJam, that should already work for a while now. I don't think we have the exact Spark smoke test coverage as you described. But I can see that Spark "testRemoveDanglin

Re: [Discuss] Make identity(String sourceName, String targetName) Public

2025-05-29 Thread Ryan Blue
Sorry, I didn't come back to this after I initially read it. I think it's fine to make this change because we can definitely have identity transform partition fields that don't match after a rename. If I remember correctly, the reason for not making this public was just to ensure partition field na

Re: [DISCUSS] API: Rename RowDelta deleteFile to removeRows

2025-05-29 Thread Ryan Blue
+1 It is good to have consistency within the RowDelta API. And I think it is a good idea in general to use "remove" to refer to removing a file from metadata, rather than "delete" because you can add or remove delete files. On Thu, May 29, 2025 at 11:46 AM Russell Spitzer wrote: > Ryan pointed

Re: [VOTE][Go] Release Apache Iceberg Go v0.3.0 RC0

2025-05-29 Thread Matt Topol
My vote is +1 (non-binding) Thank you everyone! I'll close the vote now as it passes with 4 non-binding +1 votes, 3 binding +1 votes, and no 0 or -1 votes. Binding votes: Fokko Eduard Amogh Non-binding votes: Kevin Leon JB Matt I'll start the release steps now. Thanks again! --Matt On Thu, M

Re: Wide tables in V4

2025-05-29 Thread Péter Váry
I received feedback from Alkis regarding their Parquet optimization work. Their internal testing shows promising results for reducing metadata size and improving parsing performance. They plan to formalize a proposal for these Parquet enhancements in the near future. Meanwhile, I'm putting togethe

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-05-29 Thread Xiaoxuan Li
Hi Peter, thanks for sharing the context around the Flink streaming use case and side note for concurrent write. Apologies for the delay as I just got back from a vacation. Yeah, I agree, having the index at the partition level is a better approach if we plan to use caching. As a distributed cache

[DISCUSS] v4 - One file commits

2025-05-29 Thread Ryan Blue
Hi everyone, Like Russell’s recent note, I’m starting a thread to connect those of us that are interested in the idea of changing Iceberg’s metadata in v4 so that in most cases committing a change only requires writing one additional metadata file. *Idea: One-file commits* The current Iceberg me

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Amogh Jahagirdar
Thanks for kicking this thread off Ryan, I'm interested in helping out here! I've been working on a proposal in this area and it would be great to collaborate with different folks and exchange ideas here, since I think a lot of people are interested in solving this problem. Thanks, Amogh Jahagirda

[DISCUSS] API: Rename RowDelta deleteFile to removeRows

2025-05-29 Thread Russell Spitzer
Ryan pointed out to me that whenI added the "deleteFile" method I was not following the convention already being used within the RowDelta operation and instead had copied the OverwriteFiles API. To fix this I think it would be great to change the API to "removeRows" to match the other APIs in the c

Re: [DISCUSS] Enabling more Meetups

2025-05-29 Thread Russell Spitzer
We definitely should not abdicate our responsibilities to the trade mark, I just want to shift away from a pre-clearance model which we have done so far. I know I always try to help folks out if I see something which I think may be inappropriate On Thu, May 29, 2025 at 7:50 AM Rich Bowen wrote:

Re: [Discuss] Make identity(String sourceName, String targetName) Public

2025-05-29 Thread Russell Spitzer
I'm not seeing any strong feelings on this so I'm going to go ahead and merge. If anyone else sees issues we can always address this in a follow up. On Wed, May 21, 2025 at 6:07 PM Steven Wu wrote: > It seems that the PR has made two valid arguments to support to change of > public scope > * ide

Re: [DISCUSS] Enabling more Meetups

2025-05-29 Thread Rich Bowen
On 2025/05/23 19:24:28 Russell Spitzer wrote: > Hey Y'all > > Basically I would like to get the PMC out of the meetup approval business ... > Please let me know what you think, (Board hat) A critical role of a PMC is being stewards of the project's brands/marks. So while it's not necessary

Re: [DISCUSS] Enabling more Meetups

2025-05-29 Thread Rich Bowen
On 2025/05/23 19:24:28 Russell Spitzer wrote: > Hey Y'all > > Basically I would like to get the PMC out of the meetup approval business ... > Please let me know what you think, (Board hat) A critical role of a PMC is being stewards of the project's brands/marks. So while it's not necessary

Re: [VOTE] Release Apache Iceberg Rust 0.5.1 RC1

2025-05-29 Thread Fokko Driesprong
+1 (binding) Checked signatures, checksums, and licenses, and did some checks against the REST catalog. Thanks Kevin for running this release! Kind regards, Fokko Op wo 28 mei 2025 om 12:39 schreef Renjie Liu : > Hi: > > We still need two PMC votes for this release! > > Please help to test and

[DISCUSS] V4 - Parquet as Metadata File Format

2025-05-29 Thread Russell Spitzer
Hi Y'all As discussed in the last community sync, we are beginning to gather up folks who are interested in various efforts for Iceberg V4. To that end, I'd like to use this thread as a gathering point for folks interested in the metadata file format shift to Parquet. I wrote a quick abstract to d

Re: [VOTE] Adopt the v3 spec changes

2025-05-29 Thread Ajantha Bhat
I also noticed that some of the tables shared by v2 and v3 didn't mention v3. I've updated the headers to include v3 for clarity. Please let me know if this change requires a separate vote thread: https://github.com/apache/iceberg/pull/13181 - Ajantha On Wed, May 28, 2025 at 10:27 PM Ajantha Bhat

Re: [DISCUSS] Apache Iceberg 1.10.0 release

2025-05-29 Thread ConradJam
I would like to know whether Spark 3.5 can perform some basic queries or provide file merging capabilities in the current or next version of V3? Steven Wu 于2025年5月29日周四 06:19写道: > JB, please set the milestone to the multi-arg transform PRs/issues. > > Peter/Max, ack on the changes targeted for 1

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Russell Spitzer
I’m also super excited about this idea On Thu, May 29, 2025 at 3:37 PM Amogh Jahagirdar <2am...@gmail.com> wrote: > Thanks for kicking this thread off Ryan, I'm interested in helping out > here! I've been working on a proposal in this area and it would be great to > collaborate with different fol

Re: Wide tables in V4

2025-05-29 Thread Bryan Keller
Fewer commit conflicts meaning the tables representing column families are updated independently, rather than having to serialize commits to a single table. Perhaps with a wide table solution the commit logic could be enhanced to support things like concurrent overwrites to independent column fa

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Steven Wu
This will be great for users. metadata can self adapt. Start with a compacted one file. As the table grows in size, the metadata can adapt to a tree or linked structure. On Thu, May 29, 2025 at 3:44 PM Russell Spitzer wrote: > I’m also super excited about this idea > > On Thu, May 29, 2025 at 3:

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Yufei Gu
> > BTW, does it make sense to take metadata json file into consideration as > well? Currently it is just a large json string containing all snapshots. > Since it is also on the critical path of a commit, I'm not sure if we can > explore incremental semantics on it together with manifest list files

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Péter Váry
Count me in! Do we plan to store this files in columnar format as well? On Fri, May 30, 2025, 04:00 Prashant Singh wrote: > I am also super excited about the idea ! I would love to contribute. > > On Thu, May 29, 2025 at 6:54 PM Yufei Gu wrote: > >> BTW, does it make sense to take metadata json

Re: Wide tables in V4

2025-05-29 Thread Gang Wu
IMO, the main drawback for the view solution is the complexity of maintaining consistency across tables if we want to use features like time travel, incremental scan, branch & tag, encryption, etc. On Fri, May 30, 2025 at 12:55 PM Bryan Keller wrote: > Fewer commit conflicts meaning the tables r

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Szehon Ho
Look forward to when Iceberg can move on a bit from its name, to handle slightly faster data. Interested as well to follow along, if I can ! Do we plan to store this files in columnar format as well? > Is that the other thread? https://lists.apache.org/thread/phdo75zmt8j9r44ngd7vdhtxqq63yxsp Tha

Re: [DISCUSS] Apache Iceberg 1.10.0 release

2025-05-29 Thread Prashant Singh
Thank you so much for driving this release ! It will be really helpful in getting this critical table corruption bug fix out to iceberg users : https://github.com/apache/iceberg/pull/12818 (Merged) Best, Prashant Singh On Thu, May 29, 2025 at 1:02 PM Steven Wu wrote: > > whether Spark 3.5 can p

Re: Wide tables in V4

2025-05-29 Thread Steven Wu
Bryan, interesting approach to split horizontally across multiple tables. A few potential down sides * operational overhead. tables need to be managed consistently and probably in some coordinated way * complex read * maybe fragile to enforce correctness (during join). It is robust to enforce the

Re: Wide tables in V4

2025-05-29 Thread Bryan Keller
Hi everyone, We have been investigating a wide table format internally for a similar use case, i.e. we have wide ML tables with features generated by different pipelines and teams but want a unified view of the data. We are comparing that against separate tables joined together using a shuffle-

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Prashant Singh
I am also super excited about the idea ! I would love to contribute. On Thu, May 29, 2025 at 6:54 PM Yufei Gu wrote: > BTW, does it make sense to take metadata json file into consideration as >> well? Currently it is just a large json string containing all snapshots. >> Since it is also on the c

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Gang Wu
This is a long-awaited discussion! BTW, does it make sense to take metadata json file into consideration as well? Currently it is just a large json string containing all snapshots. Since it is also on the critical path of a commit, I'm not sure if we can explore incremental semantics on it togethe

Re: [DISCUSS] V4 - Parquet as Metadata File Format

2025-05-29 Thread Ajantha Bhat
I am interested in working on this proposal. I would assume it is to use `InternalData` with the format as `parquet`. But the challenge will be the test cases, the core module cannot write the parquet metadata due to circular dependency. We need to abstract out the test cases in the core module and

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Ajantha Bhat
I am interested in these problems too. Looking forward to collaborating on this feature development. - Ajantha On Fri, May 30, 2025 at 7:07 AM Gang Wu wrote: > This is a long-awaited discussion! > > BTW, does it make sense to take metadata json file into consideration as > well? Currently it is