Re: [VOTE] Update the table statistics (puffin stats) spec

2025-07-28 Thread Péter Váry
+1 for long Given that it is implemented as a long in every known implementation, we might not even want to handle the type difference in code Eduard Tudenhöfner ezt írta (időpont: 2025. júl. 28., H, 12:47): > I agree that this should have been a long in the spec, so +1 to fixing the > spec. I

Re: [DISCUSS] Iceberg Summit NA 2026 and Iceberg Summit EU 2026 ?

2025-07-28 Thread Péter Váry
I would love to participate in an Iceberg summit, and not have to fly long hours to get there. So +1 from me Gábor Kaszab ezt írta (időpont: 2025. júl. 28., H, 15:17): > Great to see there is interest for both NA and EU! > > +1 for both locations from me. > > Best Regards, > Gabor Kaszab > > Edu

Re: [DISCUSS] v4 - Improved column statistics

2025-07-28 Thread Péter Váry
If we focus strictly on file-level column statistics, then partition level column statistics is not a concern. However, looking ahead, we likely want to support column statistics at the table or partition level as well. It would be beneficial to adopt a consistent approach to ID generation and hand

Re: [DISCUSS] v4 - Improved column statistics

2025-07-23 Thread Péter Váry
paction. >>>> >>>> A quick example >>>> >>>> Say I am using a hierarchical sort order (A, B, C) >>>> I could then store the max, min of this transform which would be >>>> independent to individual column maxes >>>&

Re: [ANNOUNCE] Welcome Prashant Singh as a new Apache Iceberg Committer

2025-07-23 Thread Péter Váry
Congratulations! Gábor Kaszab ezt írta (időpont: 2025. júl. 23., Sze, 8:13): > Congrats Prashant! > > > Eduard Tudenhöfner ezt írta (időpont: 2025. > júl. 23., Sze, 7:32): > >> Congrats Prashant, very well deserved! >> >> On Wed, Jul 23, 2025 at 5:43 AM Renjie Liu >> wrote: >> >>> Congrats, Pr

Re: [DISCUSS] FileFormat API proposal

2025-07-22 Thread Péter Váry
this is what you have suggested Péter Váry ezt írta (időpont: 2025. jún. 30., H, 18:42): > During the PR review [1], we began exploring what could we use as an > intermediate layer to reduce the need for engines and file formats to > implement the full matrix of file format - obj

Re: [DISCUSS] V4 - indexing support

2025-07-18 Thread Péter Váry
> *Primary Index*: Conventionally Primary Index - just means what the Table's Primary storage layout/organization was. Given that Iceberg supports Sort-order - if the Spec adds constraints to derive/influence Sort order based on the Identifier columns - it satisfies the Primary Index criteria. Her

Re: [DISCUSS] v4 - Improved column statistics

2025-07-17 Thread Péter Váry
gt; >>> >>> It turns out that actually producing the metric maps is a fairly >>> expensive operation, so being able to skip metrics more quickly even if the >>> bytes still have to be read is going to save time. That said, using a >>> columnar format i

Re: [DISCUSS] FileFormat API proposal

2025-06-30 Thread Péter Váry
object model transformation performance - https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4 Péter Váry ezt írta (időpont: 2025. máj. 7., Sze, 13:15): > Hi everyone, > The proposed API part is reviewed and ready to go. See: > https://github.com/apache/ice

Re: Append-only table scans in the presence of OVERWRITE snapshots

2025-06-30 Thread Péter Váry
allowing them to choose what to do with the > data. Warnings are easily overlooked. > > > On Mon, Jun 30, 2025 at 11:38 AM Péter Váry > wrote: > >> Minimally LOG.warn message about deprecation. >> Maybe a "hidden" flag which could turn back to skip overwri

Re: Flink: Current and Future state of the sink connectors

2025-06-30 Thread Péter Váry
itch was > seamless. > > On Fri, Jun 27, 2025 at 5:48 PM Péter Váry > wrote: > >> Oh.. I missed that. That's a bit better, but I'm still not sure. Do we >> know of internal users who are already on the new Sink? What was the >> experience there? >> >&

Re: Append-only table scans in the presence of OVERWRITE snapshots

2025-06-30 Thread Péter Váry
Right now, some users are discovering that they are > not reading all the data, which I believe is a much bigger surprise to them. > > On Fri, Jun 27, 2025 at 5:47 PM Péter Váry > wrote: > >> I would try to avoid breaking the current behaviour. >> Maybe after so

Re: Flink: Current and Future state of the sink connectors

2025-06-27 Thread Péter Váry
to switch the default > sink implementation earlier. Is that what you mean? > > -Max > > On Thu, Jun 26, 2025 at 5:19 PM Péter Váry > wrote: > >> +1 for the more conservative removal time >> >> Maximilian Michels ezt írta (időpont: 2025. jún. 26., >> Cs, 15:5

Re: Append-only table scans in the presence of OVERWRITE snapshots

2025-06-27 Thread Péter Váry
weird to get an error on a DELETE >> snapshot if you already explicitly opted in for reading the appends of >> OVERWRITE snapshots. >> >> Users may not be able to control the type of snapshot to be created so >> this would otherwise render this feature useless. >> &g

Re: Append-only table scans in the presence of OVERWRITE snapshots

2025-06-26 Thread Péter Váry
> Consequently, we must throw on DELETE snapshots, even if users opt-in to reading appends of OVERWRITE snapshots. OVERWRITE snapshots themselves could still contain deletes. So in this regard, I don't see a difference between the DELETE and the OVERWRITE snapshots. Maximilian Michels ezt írta (

Re: Flink: Current and Future state of the sink connectors

2025-06-26 Thread Péter Váry
t;> the problem. Will continue on this, but there’s a possibility that this >>> won’t be ready in time for this to be considered to be part of 1.10 release. >>> >>> >>> Thanks, >>> >>> Rodrigo >>> >>> >>> [1] https://

Re: Flink: Current and Future state of the sink connectors

2025-06-23 Thread Péter Váry
+1 On Mon, Jun 23, 2025, 22:04 Őrhidi Mátyás wrote: > +1, sounds reasonable to me > > Thanks, > Matyas > > On Mon, Jun 23, 2025 at 11:28 AM Rodrigo Meneses > wrote: > >> Hi devs, >> >> >> I’d like to start a discussion about the current and future state of our >> Flink Sink Connectors. >> >> >>

Re: Iceberg 0.10.0 release update - June 18, 2025

2025-06-18 Thread Péter Váry
If possible, I would love to have the File Format API interfaces approved and merged: https://github.com/apache/iceberg/pull/12774 The effort is ongoing for half a year now, and not much change requested lately. On Thu, Jun 19, 2025, 00:16 Steven Wu wrote: > sorry, I meant 1.10.0 release. Thanks

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-06-13 Thread Péter Váry
s thread: >>>>> >>>>> 1. Is this an Iceberg issue, or a Parquet (table format provider) >>>>> issue? For example, if Parquet (or other table format provider) provides a >>>>> mechanism where both query by position and query by equality

Re: Wide tables in V4

2025-06-13 Thread Péter Váry
On Fri, Jun 6, 2025 at 10:39 AM Jean-Baptiste Onofré > wrote: > >> Hi Peter >> >> Thanks for your message. It's an interesting topic. >> >> Would it not be more a data file/parquet "issue" ? Especially with the >> data file API you are proposing

Re: Wide tables in V4

2025-06-05 Thread Péter Váry
For the record, link from a user requesting this feature: https://github.com/apache/iceberg/issues/11634 On Mon, Jun 2, 2025, 12:34 Péter Váry wrote: > Hi Bart, > > Thanks for your answer! > I’ve pulled out some text from your thorough and well-organized response > to ma

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-06-04 Thread Péter Váry
t; 0.01% of records from a 2-billion-row dataset as the experimental baseline. > But yes, when updating many random keys, it's likely to touch nearly all > buckets, which increases the number of index files that must be scanned. > This is exactly why the lookup performance of the index

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-06-03 Thread Péter Váry
lly rewriting indexes to match query patterns might not > be a good idea, as it can be very costly for large datasets. Also, the > index file format will play a crucial role in key lookup performance. For > example, Hudi uses HFile for indexing, which supports fast point lookups > thanks to its

Re: [DISCUSS] v4 - Improved column statistics

2025-06-03 Thread Péter Váry
I would love to see more flexibility in file stats. Together with the change which allows storing metadata in columnar formats will open up many new possibilities. Bloom filters in metadata which could be used for filtering out files, HLL scratches etc +1 for the change On Tue, Jun 3, 2025, 0

Re: Wide tables in V4

2025-06-02 Thread Péter Váry
formance across many columns/queries - not just a select few. Additionally, this approach enables frequently requested features like adding or updating column families without rewriting the entire table. Thanks, Peter Bart Samwel ezt írta (időpont: 2025. jún. 2., H, 10:21): > On Fri, May 30

Re: [VOTE] File Format API

2025-06-02 Thread Péter Váry
. Thanks, Peter Péter Váry ezt írta (időpont: 2025. máj. 16., P, 10:51): > Thanks Ryan for your support! I know when you have time, you will check > this proposal as well. > > I understand that V3 is important. I’ve been following the votes and PRs > and can see the great progress bei

Re: Wide tables in V4

2025-05-30 Thread Péter Váry
ow us to better optimize for queries where only a small number of columns are projected from a wide table. Bart Samwel ezt írta (időpont: 2025. máj. 30., P, 16:03): > > > On Fri, May 30, 2025 at 3:33 PM Péter Váry > wrote: > >> One key advantage of introducing Physical Files

Re: Wide tables in V4

2025-05-30 Thread Péter Váry
e added >>> benefits like splitting up the metadata, fewer commit conflicts, and >>> ability to share, nest, and swap "column families". The downsides are table >>> management is split across multiple tables, it requires engine support of >>> shuffle-less join

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-05-30 Thread Péter Váry
taleness> >> (Option 3) as another potential approach, and for inspiration. >> In this setup, The engine is running periodic merge jobs and applying >> equality deletes to the actual table, based on PK. and for some cases >> applying it during runtime. >> >>

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Péter Váry
Count me in! Do we plan to store this files in columnar format as well? On Fri, May 30, 2025, 04:00 Prashant Singh wrote: > I am also super excited about the idea ! I would love to contribute. > > On Thu, May 29, 2025 at 6:54 PM Yufei Gu wrote: > >> BTW, does it make sense to take metadata json

Re: Wide tables in V4

2025-05-29 Thread Péter Váry
cause RowGroup size imbalances (placing these in separate files could significantly improve performance) Thanks, Peter Péter Váry ezt írta (időpont: 2025. máj. 28., Sze, 15:39): > I would be happy to put together a proposal based on the inputs got here. > > Thanks everyone for

Re: Wide tables in V4

2025-05-28 Thread Péter Váry
- Apache Hudi has the concept of column family [2]. >>>>>>>> - Apache Paimon supports sequence groups [3] for partial update. >>>>>>>> >>>>>>>> Although Parquet can introduce the concept of logical file and >>>>>>>

Re: [DISCUSS] Apache Iceberg 1.10.0 release

2025-05-27 Thread Péter Váry
Could we at least get the new File Format API in. Not the actual implementation, but the API itself? I would love to move forward with it, but I still need reviews from the fellow community members. Thanks, Peter On Tue, May 27, 2025, 20:26 Russell Spitzer wrote: > Thanks Steven! I know we are

Re: [VOTE] Release Apache Iceberg 1.9.1 RC1

2025-05-27 Thread Péter Váry
+1 (binding) tested signatures, built, run some tests Ajantha Bhat ezt írta (időpont: 2025. máj. 27., K, 14:40): > +1 (non-binding) > > * validated checksum and signature > * checked license docs & ran RAT checks > * ran build and tests > > - Ajantha > > > On Tue, May 27, 2025 at 7:19 AM Yuya Eb

Re: Wide tables in V4

2025-05-26 Thread Péter Váry
>>> any answer fo this yet, but I would be really interested in exploring this >>> further. >>> >>> Best Regards, >>> Yun >>> >>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>> wrote: >>> >>>> Hi Peter, I

Wide tables in V4

2025-05-26 Thread Péter Váry
Hi Team, In machine learning use-cases, it's common to encounter tables with a very high number of columns - sometimes even in the range of several thousand. I've seen cases with up to 15,000 columns. Storing such wide tables in a single Parquet file is often suboptimal, as Parquet can become a bo

Re: Discuss proposal - IRC APIs for Multi-Statement Multi-Table Transactions

2025-05-26 Thread Péter Váry
I'm interested, but can't be there, but please record the meeting. Thanks, Peter Maninderjit Singh ezt írta (időpont: 2025. máj. 24., Szo, 2:30): > Hi dev community, > I was wondering if we could join a call next week for discussing the > multi-table transactions so we can make progress. I have

Re: [VOTE] [REST SPEC] Add row lineage fields.

2025-05-23 Thread Péter Váry
+1 Eduard Tudenhöfner ezt írta (időpont: 2025. máj. 23., P, 9:27): > +1 (binding) > > On Fri, May 23, 2025 at 1:19 AM Steven Wu wrote: > >> +1 (binding) >> >> On Thu, May 22, 2025 at 3:39 PM Prashant Singh >> wrote: >> >>> Hi All, >>> I propose an update to the Rest Spec to include the Row lin

Re: [VOTE] Adopt the v3 spec changes

2025-05-20 Thread Péter Váry
+1 (binding) Well done everyone who was working on this! Fokko Driesprong ezt írta (időpont: 2025. máj. 20., K, 8:49): > +1 (binding) > > Huge milestone, thanks for driving this, Ryan! > > Kind regards, > Fokko > > Op di 20 mei 2025 om 08:45 schreef huaxin gao : > >> +1 (non-binding) >> >> On Mo

Re: [VOTE] Release Apache Iceberg 1.9.1 RC0

2025-05-20 Thread Péter Váry
nts are trying to express deletions with >>>>>> larger sizes and the server is unable to handle it due to the assertion >>>>>> in >>>>>> the older implementation, not because the protocol changed. Though I can >>>>>> see the

Re: Hive 4 support

2025-05-19 Thread Péter Váry
ass from another version of Hive). Even with `DynConstructors` you have >>>> to give it the instance of the class. Again, if you have suggestions to >>>> overcome this, we can discuss them in the PR. However, I feel that keeping >>>> a single version of test code t

Re: [VOTE] Release Apache Iceberg 1.9.1 RC0

2025-05-19 Thread Péter Váry
+1 (binding) Verified signature, built, and run some tests Maximilian Michels ezt írta (időpont: 2025. máj. 19., H, 11:17): > +1 (non-binding) > > 1. Verified the archive checksum and signature > 2. Extracted and inspected the source code for binaries > 3. Compiled and tested the source code > 4

Re: [VOTE] File Format API

2025-05-16 Thread Péter Váry
t; step to get all reviewers to spend time on this area rather than to try to > decide with a vote? I'll make sure I take the time in the next few weeks to > help with the reviews. I think that building consensus is the right next > step. > > Ryan > > On Thu, May 15, 2025 a

[VOTE] File Format API

2025-05-15 Thread Péter Váry
Hi Team, We started the discussion of the File Format API proposal [1] a long time ago [2]. Since then - during the review process - we moved from single formalization of the similar APIs to bigger changes. The lucky ones could see a presentation about the results during the Iceberg Summit [3]. Th

Re: [VOTE] Clarify writer requirements in the spec to prevent orphan DVs

2025-05-14 Thread Péter Váry
+1 (binding) Denny Lee ezt írta (időpont: 2025. máj. 14., Sze, 20:29): > +1 (non-binding) > > On Wed, May 14, 2025 at 11:21 AM Szehon Ho > wrote: > >> +1 (binding) >> >> Thanks >> Szehon >> >> On Wed, May 14, 2025 at 10:15 AM Fokko Driesprong >> wrote: >> >>> +1 (binding) >>> >>> Op wo 14 mei

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-05-13 Thread Péter Váry
025 at 4:34 PM Steven Wu wrote: > >> agree with Peter that 1:1 mapping of data files and inverted indexes are >> not as useful. With columnar format like Parquet, this can also be achieved >> equivalently by reading the data file with projection on the identifier >> columns.

Re: [Discuss] Iceberg 1.9.1 Release

2025-05-13 Thread Péter Váry
Do we really want to include a new lib version in a maintenance release? In the past, we have seen issues when upgrading libs. Avro is very important, as it is used for metadata files. I would rather not include a new version, unless it is absolutely necessary. On Tue, May 13, 2025, 06:42 Jean-Bap

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-05-12 Thread Péter Váry
ernatively, using an external > key-value store for fast lookups could also be explored. Would be great to > hear others’ thoughts on this. > > > Thanks, > > Xiaoxuan > > On Fri, May 9, 2025 at 8:12 AM Péter Váry > wrote: > >> When going through the optio

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-05-09 Thread Péter Váry
When going through the options mentioned by Anton, I feel that Option 1 and 4 are just pushing the responsibility of converting the equality deletes to positional deletes to the engine side. The only difference is whether the conversion happens on the write side or on the read side. This is a step

Re: [DISCUSS] FileFormat API proposal

2025-05-07 Thread Péter Váry
, so I will not merge the PR this week. Please review it if you. Thanks, Peter Péter Váry ezt írta (időpont: 2025. ápr. 16., Sze, 7:02): > Hi Renjie, > The first one for the proposed new API is here: > https://github.com/apache/iceberg/pull/12774 > Thanks, Peter > > On Wed, A

Re: [DISCUSS] FileFormat API proposal

2025-04-15 Thread Péter Váry
gt; > I'm quite interested in this topic, and please ping me in those splitted > prs and I'll help to review. > > On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré > wrote: > >> Hi Peter >> >> Awesome ! Thank you so much ! >> I will do a new pass. &

Re: [VOTE] Release Apache Iceberg 1.9.0 RC0

2025-04-15 Thread Péter Váry
+1 checked the signatures compiled the code run some tests Aihua Xu ezt írta (időpont: 2025. ápr. 14., H, 0:54): > +1 (non-binding). > > Verified against Snowflake build. > > On Sun, Apr 13, 2025 at 12:20 AM Yuya Ebihara < > yuya.ebih...@starburstdata.com> wrote: > >> +1 (non-binding) >> >> Tr

Re: [DISCUSS] FileFormat API proposal

2025-04-11 Thread Péter Váry
le providers (Parquet, Avro, ORC) > > Thoughts ? I can help on the split if needed. > > Regards > JB > > On Thu, Apr 10, 2025 at 5:16 AM Péter Váry > wrote: > > > > Since the 1.9.0 release candidate has been created, I would like to > resurrect this PR: https:

Re: [DISCUSS] FileFormat API proposal

2025-04-10 Thread Péter Váry
original change was not small, and grew substantially during the review rounds. So if you have questions, or I can do anything to make the review easier, don't hesitate to ask. I am happy to do anything to move this forward. Thanks, Peter Péter Váry ezt írta (időpont: 2025. márc. 26., Sze,

Re: [DISCUSS] FileFormat API proposal

2025-04-05 Thread Péter Váry
(position > deletes, etc) to further prune data. > > But anyway, I agree that we could postpone it in follow up pr. > > 2. *Batch size configuration* > > I'm leaning toward option 2. > > 3. *Spark configuration* > > I'm leaning towards using differe

Re: Optimize Equality Deletes with Sorting

2025-04-05 Thread Péter Váry
Hi Edgar, Thanks for the well described proposal! Knowing the Flink connector, I have the following concerns: - Flink connector currently doesn't sort the rows in the data files. It "chickens" out of this to avoid keeping anything in memory. - Sorting the equality delete rows would also add memor

Re: [DISCUSS] FileFormat API proposal

2025-04-05 Thread Péter Váry
users. For this I have updated the deprecation comments accordingly. I would like to ask you to review the PR, so we iron out any possible requested changes and be ready for the merge as soon as possible after the 1.9.0 release. Thanks, Peter Péter Váry ezt írta (időpont: 2025. márc. 21., P, 14:32

Re: Hive 4 support

2025-04-05 Thread Péter Váry
Hi Wing Yew, Thanks for taking a look at this. After the removal of the Hive runtime code, we only depend on HMS in the HiveCatalog module in the production code. Since HMS API is supposed to be backward compatible, I would prefer to keep a single hive-metastore module and a single source dir for

Re: [VOTE] Row lineage required for v3

2025-04-01 Thread Péter Váry
+1 On Tue, Apr 1, 2025, 17:22 Steve Zhang wrote: > +1 (non binding) > > Thanks, > Steve Zhang > > > > On Mar 31, 2025, at 11:05 PM, Jean-Baptiste Onofré > wrote: > > +1 (non binding) > > >

Re: cleanExpiredMetadata in RemoveSnapshots

2025-03-26 Thread Péter Váry
I know of several companies who are using either scheduled stored procedures or the existing actions to maintain production tables. I don't think we should deprecate them until there is a viable open solution for them. Manu Zhang ezt írta (időpont: 2025. márc. 19., Sze, 17:52): > I think a catal

Re: [DISCUSS] Row lineage required for v3

2025-03-24 Thread Péter Váry
> Would this property cause streaming writes using equality deletes to fail until the table is updated? I’m open to this solution since I think people should definitely be aware of the trade-offs they’re making in their tables. I don't think we can do such a check on the Iceberg side. As discussed

Re: [DISCUSS] Row lineage required for v3

2025-03-20 Thread Péter Váry
I agree with most what Ryan said with a single important exception: > If an engine doesn’t implement row ID preservation (which, by the way, is not hard!) [..] For streaming applications (Flink, Kafka Connect) this is a non-trivial task. The engine either has to keep everything in memory or do co

Re: [DISCUSS] FileFormat API proposal

2025-03-20 Thread Péter Váry
the community decides on Thanks, Peter Jean-Baptiste Onofré ezt írta (időpont: 2025. márc. 14., P, 16:31): > Hi Peter > > Thanks for the update. I will do a new pass on the PR. > > Regards > JB > > On Thu, Mar 13, 2025 at 1:16 PM Péter Váry > wrote: > > &g

Re: cleanExpiredMetadata in RemoveSnapshots

2025-03-15 Thread Péter Váry
I would be hesitant to turn on any new feature by default. Especially for Spark compaction which is widely used in production. +1 for providing a way for the users to enable the feature manually Gabor Kaszab ezt írta (időpont: 2025. márc. 14., P, 12:19): > Hi Iceberg Community, > > There were r

Re: [DISCUSS] FileFormat API proposal

2025-03-13 Thread Péter Váry
make the review easier, please let me know. Thanks, Peter Péter Váry ezt írta (időpont: 2025. febr. 28., P, 17:50): > Hi everyone, > Thanks for all of the actionable, relevant feedback on the PR ( > https://github.com/apache/iceberg/pull/12298). > Updated the code to address most of

Re: [Flink] Remove FlinkSink for Flink 2.0

2025-03-12 Thread Péter Váry
I agree with the general sentiment, that we should remove FlinkSource, but only deprecate FlinkSink. I understand Steven's approach to remove the deprecated classes only from the latest Flink version, but keep them for the old versions, but I have some concerns with this. My main concern is that

Re: [VOTE] Release Apache Iceberg 1.7.2 RC2

2025-03-12 Thread Péter Váry
+1 checked the signatures and run some tests Fokko Driesprong ezt írta (időpont: 2025. márc. 12., Sze, 12:21): > +1 (binding) > > I've updated the PR , > but then forgot about it. > > Kind regards, > Fokko > > Op wo 12 mrt 2025 om 11:35 schre

Re: [DISCUSS] FileFormat API proposal

2025-02-28 Thread Péter Váry
ing and well written. > I like the DataFile API and definitely worth to discuss all together. > > Maybe we can schedule a specific meeting to discuss about DataFile API ? > > Thoughts ? > > Regards > JB > > On Tue, Feb 11, 2025 at 5:46 PM Péter Váry > wrote: > >

Re: [VOTE] Release Apache Iceberg 1.8.1 RC1

2025-02-26 Thread Péter Váry
+1 checked the signatures, checksums build and run some tests Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. febr. 26., Sze, 6:11): > +1 (binding) > > Verified signatures, checksum, RAT checks. > Ran build and test with JDK17 > > Thanks, > Amogh Jahagirdar > > On Wed, Feb 26, 2025

Re: [VOTE] Java implementation notes around current-snapshot-id

2025-02-24 Thread Péter Váry
+1 On Tue, Feb 25, 2025, 04:16 Steve Zhang wrote: > +1 (nb) > Thanks, > Steve Zhang > > > > On Feb 24, 2025, at 6:32 PM, Renjie Liu wrote: > > +1 > > On Tue, Feb 25, 2025 at 7:00 AM Szehon Ho wrote: > >> +1 >> >> Thanks >> Szehon >> >> On Mon, Feb 24, 2025 at 2:52 PM rdb...@gmail.com >> wrote

Re: [VOTE] Allow Row-Lineage with Equality Deletes

2025-02-20 Thread Péter Váry
+1 Manu Zhang ezt írta (időpont: 2025. febr. 20., Cs, 8:06): > +1 (non-binding) > > Regards > Manu > > On Thu, Feb 20, 2025 at 2:57 PM Jean-Baptiste Onofré > wrote: > >> +1 >> >> Regards >> JB >> >> On Wed, Feb 19, 2025 at 11:13 PM Russell Spitzer >> wrote: >> > >> > The PR: https://github.com

Re: [DISCUSS] Add metadata stats/metrics management on the REST Spec

2025-02-19 Thread Péter Váry
@Fokko: your point is absolutely valid. We don't want to burden the active catalog instance with returning such a big data set. Otherwise the main responsibility of the catalog could suffer. OTOH there is some info which exists only on the catalog side which is not available elsewhere. This is esp

Re: FileRewrite API refactor

2025-02-18 Thread Péter Váry
> My other point is that those context/plan info may not need to be > duplicated. E.g., currently the "RewriteExecutionContext" look duplicated > in "RewriteDataFileSparkAction" and "RewritePositionDeleteFilesSparkAction" > classes > > On Fri, Feb 7,

Re: [DISCUSS] FileFormat API proposal

2025-02-18 Thread Péter Váry
- https://github.com/apache/iceberg/pull/12298/commits/313c2d59b04db390be09172356d3f5359e6f6d6e - Flink reader/writer implementation with the new interfaces Péter Váry ezt írta (időpont: 2025. febr. 18., K, 10:08): > Hi Renjie, > > Based on your feedback, I have created a

Re: [DISCUSS] FileFormat API proposal

2025-02-18 Thread Péter Váry
the new interfaces Thanks, Peter Péter Váry ezt írta (időpont: 2025. febr. 14., P, 11:30): > Hi Renjie, > Here is the WIP PR for the readers: > https://github.com/apache/iceberg/pull/12069 > Here is the WIP PR for the writers: > https://github.com/apache/iceberg/pull/12164 >

Re: [DISCUSS] FileFormat API proposal

2025-02-14 Thread Péter Váry
eems too abstract to understand, do > you mind to submit a pr so that it would be more clear what's changed? > > On Wed, Feb 12, 2025 at 12:46 AM Péter Váry > wrote: > >> Hi Team, >> >> As mentioned earlier on our Community Sync I am exploring the >> pos

Re: [VOTE] Release Apache Iceberg 1.8.0 RC0

2025-02-13 Thread Péter Váry
A late +1 - I just got to checking the signatures, checksum, finished building and running some tests. Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. febr. 13., Cs, 7:43): > Thanks everyone who participated in the vote for Release Apache Iceberg > 1.8.0 RC0. > > The vote result is:

Re: [Discussion] Spec change for Row Lineage - Allow Equality Deletes

2025-02-12 Thread Péter Váry
ed, Feb 12, 2025, 19:06 Russell Spitzer wrote: > I'm not sure I follow how one could figure out the equality delete row ID > after the fact. Won't I need to use some other primary key identifier and > do a shuffle join to line it up > with existing records? > > On W

Re: [Discussion] Spec change for Row Lineage - Allow Equality Deletes

2025-02-12 Thread Péter Váry
t blindly claim deletion > of any previous rows ( *if there are any*) marked by the equality key. > > > On Tue, Feb 11, 2025 at 10:11 PM Péter Váry > wrote: > >> Hi Russell, >> >> Thanks for bringing this up! >> I think equality deletes are not the root o

Re: [Discussion] Spec change for Row Lineage - Allow Equality Deletes

2025-02-11 Thread Péter Váry
Hi Russell, Thanks for bringing this up! I think equality deletes are not the root of the problem here. - If we have a positional delete, and the new row doesn't include the old rowId, then the lineage info is lost. - If we have an equality delete, and the new row contains the rowId, then we have

[DISCUSS] FileFormat API proposal

2025-02-11 Thread Péter Váry
Hi Team, As mentioned earlier on our Community Sync I am exploring the possibility to define a FileFormat API for accessing different file formats. I have put together a proposal based on my findings. --- Iceberg currently supports 3 different file formats: Avro, Parquet, ORC. Wit

Re: Proposal: Parquet footer size in Iceberg metadata

2025-02-09 Thread Péter Váry
Do we want to create data file format specific maps for these metadata to better separate for the differentformats? I don't think having Parquet footer size is relevant for Avro or ORC. On Mon, Feb 10, 2025, 06:52 Xuanwo wrote: > +1 non-binding from me. > > I love this idea. Even though S3 suppo

Re: [VOTE] Simplify multi-arg table metadata

2025-02-09 Thread Péter Váry
+1 On Mon, Feb 10, 2025, 03:44 Manu Zhang wrote: > +1 (non-binding) > > On Mon, Feb 10, 2025 at 10:25 AM roryqi wrote: > >> +1 >> >> xianjin 于2025年2月10日周一 10:02写道: >> >>> +1 (non-binding) >>> >>> On Mon, Feb 10, 2025 at 2:03 AM Hussein Awala wrote: >>> +1 (non-binding) On Sun,

Re: [VOTE] Add Geometry and Geography types for V3

2025-02-07 Thread Péter Váry
+1 On Fri, Feb 7, 2025, 21:20 Kevin Liu wrote: > +1 (non-binding) > It's great to see support for more data types in both parquet and Iceberg! > > Best, > Kevin Liu > > On Fri, Feb 7, 2025 at 12:11 PM huaxin gao wrote: > >> +1 (non-binding) >> >> On Fri, Feb 7, 2025 at 12:03 PM Honah J. wrote:

Re: FileRewrite API refactor

2025-02-07 Thread Péter Váry
; in the core module. It could be just a super > set of all needed fields. > > > On Wed, Feb 5, 2025 at 2:31 AM Péter Váry > wrote: > >> Hi Steven, >> >> Thanks for chiming in! >> >> The decision points I have collected: >> >>- Da

Re: Welcome Huaxin Gao as a committer!

2025-02-06 Thread Péter Váry
Congratulations! Matt Topol ezt írta (időpont: 2025. febr. 6., Cs, 10:40): > Congrats! Welcome! > > On Thu, Feb 6, 2025, 10:19 AM Raúl Cumplido wrote: > >> Congrats Huaxin! >> >> El jue, 6 feb 2025 a las 10:16, Gang Wu () escribió: >> >>> Congrats Huaxin! >>> >>> Best, >>> Gang >>> >>> On Thu,

Re: FileRewrite API refactor

2025-02-05 Thread Péter Váry
list of files to be compacted) and > `PlanContext` (common metadata) > > > On Sat, Feb 1, 2025 at 3:06 AM Russell Spitzer > wrote: > >> We probably still have to support it as long as we have V2 Table support >> right? >> >> On Fri, Jan 31, 2025 at 9:13 AM

Re: guideline for interface change

2025-01-31 Thread Péter Váry
Can we deprecate the old method, and provide a default implementation for the new method using the old one? This would keep the old functionality until the deprecated method is removed. On Sat, Feb 1, 2025, 02:01 Aihua Xu wrote: > Hi folks, > > What is the general guideline for interface change

Re: FileRewrite API refactor

2025-01-31 Thread Péter Váry
o take a > RewriteGroup and generate new files, I think this should be independent of > the planner below > 3) Planner: A Non-Engine specific class which knows how to generate > RewriteGroups given a set of parameters > > On Tue, Jan 14, 2025 at 7:08 AM Péter Váry > wrote: > &g

Re: [DISCUSS/VOTE] Add in ChangeLog Reserved Field IDs to Spec and Decrement Row Lineage Reserved IDs

2025-01-25 Thread Péter Váry
+1 Thanks for taking care of this! On Fri, Jan 24, 2025, 23:20 Yufei Gu wrote: > Thanks for fixing this, Russell! > > +1 for keeping the changelog view related id as is, given the changelog > view has been widely used. > > Yufei > > > On Fri, Jan 24, 2025 at 12:35 PM Russell Spitzer < > russell.

Re: [VOTE] Document Snapshot Summary Optional Fields as Subsection of Appendix F in Spec

2025-01-21 Thread Péter Váry
+1 On Wed, Jan 22, 2025, 06:06 huaxin gao wrote: > +1 (non-binding) > > On Tue, Jan 21, 2025 at 6:04 PM Manu Zhang > wrote: > >> +1 (non-binding) >> >> Thanks & Regards >> >> On Wed, Jan 22, 2025 at 8:06 AM Daniel Weeks wrote: >> >>> +1 (binding) >>> >>> On Tue, Jan 21, 2025 at 1:05 PM Szehon

Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-15 Thread Péter Váry
+1 Steven Wu ezt írta (időpont: 2025. jan. 16., Cs, 0:46): > +1 > > On Wed, Jan 15, 2025 at 9:00 AM Russell Spitzer > wrote: > >> Hi Everyone! >> >> PR: https://github.com/apache/iceberg/pull/11976/files >> >> Split out from #11948 >> >> Working on

Re: [VOTE] Document Snapshot Summary Optional Fields as Appendix in Spec

2025-01-14 Thread Péter Váry
+1 On Tue, Jan 14, 2025, 21:05 Russell Spitzer wrote: > +1 > > On Tue, Jan 14, 2025 at 2:00 PM Honah J. wrote: > >> Hi everyone, >> >> Based on good feedback on the [DISCUSS] thread >> . and >> the pull request >>

FileRewrite API refactor

2025-01-14 Thread Péter Váry
Hi Team, There is ongoing work to bring Flink Table Maintenance to Iceberg [1]. We already merged the main infrastructure and are currently working on implementing the data file rewrite [2]. During the implementation we found that part of the compaction planning implemented for Spark compaction, c

Re: [DISCUSS] Hive Support

2025-01-08 Thread Péter Váry
Hi Manu, My hope is that the Hive 4 problem is "only" a test issue. Since similar tests are running (or were running when I last have seen it in the Hive codebase), there should be a working version of TestHiveMetastore which runs these tests. We might be able to incorporate a similar code into ou

Re: [DISCUSS] Hive Support

2025-01-07 Thread Péter Váry
testing matrix where some tests are >>> run with both Hive 3 and Hive 4, and some tests are run with only Hive3 >>> (older Spark versions which does not support Hive 4) >>> >>> Thanks Manu for driving this! >>> Peter >>> >>> Manu Zhang

Re: [DISCUSS] Hive Support

2025-01-06 Thread Péter Váry
sion from >> the Spark runtime. > > > Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect > compatibility to be much better once Iceberg and Spark are both on Hive 4. > > Secondly, the coupling can be loosed if we are moving toward the REST > catalog. > >

Re: [DISCUSS] Hive Support

2025-01-03 Thread Péter Váry
h them. > Otherwise, we need to ask users to exclude hive libraries from Spark and > ship iceberg-spark runtime with Iceberg's hive dependencies.\ > > Regards, > Manu > > On Wed, Dec 18, 2024 at 9:08 PM Péter Váry > wrote: > >> @Manu: What will be the end result?

Re: [DISCUSS] Hive Support

2024-12-18 Thread Péter Váry
@Manu: What will be the end result? Do we have to use the same Hive version in Iceberg as it is defined by Spark? I think we should make sure that the Iceberg Hive version is independent from the version used by Spark On Mon, Dec 16, 2024, 21:58 rdb...@gmail.com wrote: > > I'm not sure there's a

Re: [DISCUSS] December board report

2024-12-11 Thread Péter Váry
Hi Ryan, Thanks for putting this together! For Java/Flink we could mention that ExpireSnapshots TableMaintenance is available now. On Thu, Dec 12, 2024, 04:47 Ajantha Bhat wrote: > At Java side, I would add > > - Core util to compute partition stats has been merged. > https://github.com/apache/i

  1   2   3   >