Re: Wide tables in V4

2025-05-29 Thread Gang Wu
t;>>>>>>> such that in each horizontal partitioning(fragment), the same number of >>>>>>>> rows exist in each vertical partition, which I think is necessary to >>>>>>>> make >>>>>>>> whole/partial row co

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Gang Wu
This is a long-awaited discussion! BTW, does it make sense to take metadata json file into consideration as well? Currently it is just a large json string containing all snapshots. Since it is also on the critical path of a commit, I'm not sure if we can explore incremental semantics on it togethe

Re: Wide tables in V4

2025-05-26 Thread Gang Wu
I agree with Steven that there are limitations that Parquet cannot do. In addition to adding new columns by rewriting all files, files of wide tables may suffer from bad performance like below: - Poor compression of row groups because there are too many columns and even a small number of rows can

Re: [VOTE] Adopt the v3 spec changes

2025-05-19 Thread Gang Wu
+1 (non-binding) On Tue, May 20, 2025 at 11:35 AM Manish Malhotra < manish.malhotra.w...@gmail.com> wrote: > +1 (non-binding) > > This is Awesome! Thanks 🙏🏼 > > On Mon, May 19, 2025 at 6:09 PM Szehon Ho wrote: > >> +1 (binding) >> >> Thanks, it's an exciting step for Iceberg! >> Szehon >> >> On

Re: [VOTE] Clarify writer requirements in the spec to prevent orphan DVs

2025-05-15 Thread Gang Wu
+1 (non-binding) On Thu, May 15, 2025 at 5:40 PM Jean-Baptiste Onofré wrote: > +1 (non binding) > > Regards > JB > > Le mer. 14 mai 2025 à 17:52, Anton Okolnychyi a > écrit : > >> Hi all, >> >> I propose the following update to the spec to clarify that writers must >> remove any deletion vector

Re: [VOTE] Merge details about GZip metadata files to the spec.

2025-05-11 Thread Gang Wu
+1 (non-binding) On Mon, May 12, 2025 at 3:27 AM Kevin Liu wrote: > +1 (non-binding) > > Thanks for starting a vote. > > There's extra context in the PR description. As a summary, > `gz.metadata.json` is the current naming convention for GZIP compressed > metadata.json file and is implemented in

Re: [VOTE] Minor clarification for Geo Spec

2025-05-06 Thread Gang Wu
The clarification is simple and clear from the writer's perspective. CMIW, the implication is that reader should drop bbox with any NaN value regardless of the coordinate axis (in case of a writer bug). On Wed, May 7, 2025 at 6:21 AM Huang-Hsiang Cheng wrote: > +1 (non-binding) > > On May 6, 20

Re: [DISCUSS] Finalizing the v3 spec

2025-04-30 Thread Gang Wu
Thanks JB and Fokko! I agree that we are good with multi-arg transform for v3. Best, Gang On Wed, Apr 30, 2025 at 2:12 PM Xuanwo wrote: > Hi Ryan. > > Thank for starting this. > > I share the same concern as Russell regarding the recent discussion about > `metadata.json.gz`. I think it's a good

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Gang Wu
Please correct me if I'm wrong. The v3 spec for multi-arg transform only advises to use `source-ids` instead of `source-id`. Although it is implicit and obvious that only bucket transform can apply to multi-arg transform, it is still unclear the order of source columns and algorithm to use to calc

Re: [VOTE] Small spec change for default values

2025-04-22 Thread Gang Wu
+1 (non-binding) On Wed, Apr 23, 2025 at 4:42 AM Prashant Singh wrote: > +1 (non-binding) > > Best, > Prashant Singh > > On Tue, Apr 22, 2025 at 2:55 AM Eduard Tudenhöfner < > etudenhoef...@apache.org> wrote: > >> +1 >> >> On Tue, Apr 22, 2025 at 7:31 AM Jean-Baptiste Onofré >> wrote: >> >>> +1

Re: [VOTE] Spec Update: Variant Field Lower/Upper Bounds

2025-04-19 Thread Gang Wu
+1 (non-binding) On Sat, Apr 19, 2025 at 12:12 PM Steve Zhang wrote: > +1 (non-binding) > > Thanks, > Steve Zhang > > > > On Apr 18, 2025, at 1:29 PM, huaxin gao wrote: > > +1 (non-binding) > > >

Re: [VOTE] Update row lineage spec ID assignment

2025-04-17 Thread Gang Wu
+1 (non-binding) On Fri, Apr 18, 2025 at 7:38 AM huaxin gao wrote: > +1 (non-binding) > > On Thu, Apr 17, 2025 at 4:22 PM Denny Lee wrote: > >> +1 (non-binding) >> >> On Thu, Apr 17, 2025 at 5:14 PM Aihua Xu wrote: >> >>> + (non-binding). >>> >>> On Thu, Apr 17, 2025 at 11:22 AM Steven Wu wro

[DISCUSS] Change iceberg-cpp CI settings to only require approval for new contributors

2025-04-04 Thread Gang Wu
Hi, I am starting this thread to gather feedback for iceberg-cpp to only require approvals for new contributors. iceberg-cpp is still at its early stage and almost all contributors are new to the Iceberg community. Changing the CI setting makes it easier for people to collaborate. If there's cons

Re: Optimize Equality Deletes with Sorting

2025-04-02 Thread Gang Wu
CMIW, the spec does not enforce `identifier fields` for equality delete files. Engines are free to use different `equality_ids` among commits, though the use case should be rare. Similarly, what sort order should we use? It is common for a table to set sort order on columns other than the primary k

Re: [DISCUSS] Change iceberg-cpp CI settings to only require approval for new contributors

2025-03-31 Thread Gang Wu
l, +1 for this change. >> >> On Mon, Mar 31, 2025 at 4:50 PM Xuanwo wrote: >> >>> Here is my +1 non-binding. >>> >>> I'm watching on this repo actively and will make sure the CI won't be >>> abused. >>> >>> On Mon, Mar

Re: [VOTE] Row lineage required for v3

2025-03-31 Thread Gang Wu
+1 (non-binding) On Tue, Apr 1, 2025 at 5:15 AM Russell Spitzer wrote: > +1 > > On Mon, Mar 31, 2025 at 2:22 PM Amogh Jahagirdar <2am...@gmail.com> wrote: > >> +1 (binding) >> >> Thanks Dan! >> >> On Mon, Mar 31, 2025 at 1:20 PM Ryan Blue wrote: >> >>> +1 >>> >>> On Mon, Mar 31, 2025 at 12:01 P

Re: [VOTE] Minor simplifications for Geo Spec

2025-03-18 Thread Gang Wu
Makes sense. +1 (non-binding) On Wed, Mar 19, 2025 at 8:07 AM Jia Yu wrote: > +1 (non-binding) > > Thank you! > > On 2025/03/19 00:01:00 Szehon Ho wrote: > > Hi everyone, > > > > While working on the reference implementation for Geometry/Geography > spec, > > we noticed some parts that can be s

Re: Clarification on sorting floating-point numbers

2025-02-27 Thread Gang Wu
FYI: there was an effort from Jan (cc'd) to introduce a total order for floating-point numbers on the Parquet side: [1][2]. [1] https://github.com/apache/parquet-format/pull/221 [2] https://github.com/apache/parquet-format/pull/196 On Thu, Feb 27, 2025 at 4:24 AM Devin Smith wrote: > The spec h

Re: [VOTE] Allow Row-Lineage with Equality Deletes

2025-02-19 Thread Gang Wu
+1 (non-binding) On Thu, Feb 20, 2025 at 7:12 AM Steven Wu wrote: > +1 > > On Wed, Feb 19, 2025 at 2:15 PM Russell Spitzer > wrote: > >> The PR: https://github.com/apache/iceberg/pull/12230 is basically ready >> now. So let's do a last vote to make sure everyone is aware of the upcoming >> cha

Re: [DISCUSS] Introduce C FFI for iceberg rust

2025-02-17 Thread Gang Wu
Thanks Xuanwo! Looking forward to the possibility of iceberg-cpp integration with the C FFI! Best, Gang On Tue, Feb 18, 2025 at 3:21 PM Renjie Liu wrote: > Hi: > > Thanks Xuanwo for raising this. > > As xuanwo mentioned, rust implementation + c binding will provide a good > foundation for cros

Re: [VOTE] Add RemoveSchemas update type to REST spec

2025-02-11 Thread Gang Wu
+1 (non-binding) On Wed, Feb 12, 2025 at 6:17 AM Amogh Jahagirdar <2am...@gmail.com> wrote: > +1 thanks for driving this Gabor! > > On Wed, Feb 12, 2025 at 2:35 AM rdb...@gmail.com wrote: > >> +1 >> >> On Tue, Feb 11, 2025 at 10:50 AM Steve Zhang >> wrote: >> >>> +1 nb >>> >>> Thanks, >>> Steve

Re: [Discussion] Spec change for Row Lineage - Allow Equality Deletes

2025-02-11 Thread Gang Wu
eletes. Instead, this spec change now says > the updated row is a complete new row with new row_id. > > On Tue, Feb 11, 2025 at 7:39 PM Gang Wu wrote: > >> Hi Russell, >> >> Thanks for supporting equality deletes to row lineage! >> >> > accept that "

Re: [Discussion] Spec change for Row Lineage - Allow Equality Deletes

2025-02-11 Thread Gang Wu
Hi Russell, Thanks for supporting equality deletes to row lineage! > accept that "updates" will be treated as "delete" and "insert" I would say that it has obvious drawbacks below (though it is better than not supported): 1) updates will be populated differently when outputting changelogs to use

Re: [VOTE] Simplify multi-arg table metadata

2025-02-09 Thread Gang Wu
+1 (non-binding) (she says hi to your cat!) Best, Gang On Sun, Feb 9, 2025 at 5:02 PM Fokko Driesprong wrote: > (Second attempt, the cat ran over the keyboard) > > Hey everyone, > > After the positiv

Re: [VOTE] Add Geometry and Geography types for V3

2025-02-06 Thread Gang Wu
+1 (non-binding) On Fri, Feb 7, 2025 at 8:20 AM Daniel Weeks wrote: > +1 > > On Thu, Feb 6, 2025, 4:02 PM Russell Spitzer > wrote: > >> +1 >> >> On Fri, Feb 7, 2025 at 12:57 AM Denny Lee wrote: >> >>> +1 (non-binding) - super exciting! >>> >>> On Thu, Feb 6, 2025 at 3:52 PM rdb...@gmail.com >

Re: Welcome Huaxin Gao as a committer!

2025-02-06 Thread Gang Wu
Congrats Huaxin! Best, Gang On Thu, Feb 6, 2025 at 5:10 PM Szehon Ho wrote: > Hi everyone, > > The Project Management Committee (PMC) for Apache Iceberg has > invited Huaxin Gao to become a committer, and I am happy to announce that > she has accepted. Huaxin has done a lot of impressive work

Re: [discuss] Standardizing Naming Schemes for Language-Specific Configurations

2025-01-23 Thread Gang Wu
Generally it makes sense to define separate language-specific configurations. I think we need to think about the following items: 1. Is it python-specific to add the prefix? Should Rust/Go be -rs/-go as the convention? 2. Which part of the spec is the best place to describe this? It seems that we

Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-15 Thread Gang Wu
+1 (non-binding) On Thu, Jan 16, 2025 at 2:30 PM Péter Váry wrote: > +1 > > Steven Wu ezt írta (időpont: 2025. jan. 16., Cs, > 0:46): > >> +1 >> >> On Wed, Jan 15, 2025 at 9:00 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> Hi Everyone! >>> >>> PR: https://github.com/apache/ic

Re: [DISCUSS] Relocate Parquet to Iceberg Core

2024-12-18 Thread Gang Wu
IIUC, iceberg-parquet depends on iceberg-arrow for the vectored reader implementation (though partially supported). Should we relocate iceberg-arrow together? Since I have mentioned that the vectored reader implementation is partially supported, is it a direction that needs to be improved? There i

Re: [DISCUSS] December board report

2024-12-11 Thread Gang Wu
For C++, I think it is aimed for a full featured C++ library (not for puffin implementation only). On Thu, Dec 12, 2024 at 6:14 AM rdb...@gmail.com wrote: > I'll update it. Thanks! > > (By the way, the Avro default value support was in the Java section) > > On Wed, Dec 11, 2024 at 2:00 PM Matt T

Re: New committer: Matt Topol

2024-12-10 Thread Gang Wu
Congrats Matt! On Tue, Dec 10, 2024 at 8:57 PM Sung Yun wrote: > Congratulations Matt! > > On 2024/12/10 12:49:25 Alex Dutra wrote: > > Congratulations, Matt! Go!! > > > > On Tue, Dec 10, 2024 at 1:08 PM Péter Váry > > wrote: > > > > > Congratulations Matt! > > > > > > On Tue, Dec 10, 2024, 12:

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Gang Wu
umplido wrote: > >> This sounds awesome. I am looking forward to the slack channel being >> available so I can also help! >> >> El vie, 22 nov 2024 a las 10:03, Gang Wu () escribió: >> > >> > Thanks for the support, Fokko and JB! >> > >> &g

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Gang Wu
from > the Impala community we could add some additional auxiliary functionality > for the V3 positional deletes later on. > > 2) I learned that a part of the community is interested in having a C++ > implementation of the Iceberg lib in general for their C++ engine. cc @Gang &g

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-11-22 Thread Gang Wu
d that a part of the community is interested in having a C++ > implementation of the Iceberg lib in general for their C++ engine. cc @Gang > Wu > > There seemed to be general support from the community to start up such a > sub-project, so I'm reaching out now to ask for some gu

Re: [DISCUSS] - Deprecate Equality Deletes

2024-10-30 Thread Gang Wu
Thanks Russell for bringing this up! +1 on deprecating equality deletes. IMHO, this is something that should reside only in the ingestion engine. Best, Gang On Thu, Oct 31, 2024 at 5:07 AM Russell Spitzer wrote: > Background: > > 1) Position Deletes > > > Writers determine what rows are delet

Re: [VOTE] Deletion Vectors in V3

2024-10-29 Thread Gang Wu
+1 (non-binding) Best, Gang On Wed, Oct 30, 2024 at 5:46 AM Anton Okolnychyi wrote: > Hi folks, > > We have been discussing the new layout for position deletes in V3 for a > while now. It seems the community reached consensus. I'd like to start a > vote on adding deletion vectors to the V3 spec

Re: [DISCUSS] Additional language implementations for Iceberg Puffin reader/writer

2024-08-29 Thread Gang Wu
Hi, It won't be an issue if there is already an iceberg-cpp implementation. However, it is unfortunate to see duplicate efforts from different query engines to implement their own C++ Iceberg reader and writers. Is it a good chance to add official C++ implementation by providing a puffin reader/wr

Re: [DISCUSS] Variant Spec Location

2024-08-23 Thread Gang Wu
>>> >>> This could be developed separately and then be represented in Arrow >>> using an extension type (perhaps a canonical one as in >>> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html). >>> >>> What do other Arrow developers

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Gang Wu
gt;> >> Hi Gang, >> >> Sorry, but can you give a pointer to the start of this discussion thread >> in a readable format (for example a mailing-list archive)? It appears >> that dev@arrow wasn't cc'ed from the start and that can make it >> difficult to und

Re: [DISCUSS] Variant Spec Location

2024-08-21 Thread Gang Wu
usion > > extension that operates on this [1], and already have some ideas on how > > such an extension type might be defined. I'm not yet caught up on the > > shredded specification, but I think having just the binary format would > be > > beneficial for in-memory an

Re: Type promotion in v3

2024-08-19 Thread Gang Wu
Hi Micah, If we go with the approach that type promotion results in a change in the field-id, what happens when a certain field has been changed multiple times? Does it mean that we end up with tracking the lineage of field change history? Thanks, Gang On Tue, Aug 20, 2024 at 7:34 AM Micah Kornf

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Gang Wu
. > > It is worth noting that we also need to standardize many functions > related to it. > > A neutral place to maintain it is a great choice. > > - As Gang Wu said, a standalone project is good, just like RoaringBitmap > [1]. > - As Ryan said, Parquet community is a ne

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Gang Wu
different and I don't think this should >> block forking the spec, but we should make sure that the decision is >> publicly documented within both communities. >> >> Thanks, >> Micah >> >> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer < >> russel

Re: [DISCUSS] Variant Spec Location

2024-08-14 Thread Gang Wu
Sorry for chiming in late. >From the discussion in https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I don't quite understand why it is logistically complicated to create a sub-project to hold the variant spec and impl. IMHO, coping the variant type spec into Apache Iceberg has so

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-06 Thread Gang Wu
Just give my two cents. Not all tables have partition definition and table-level stats would benefit these tables. In addition, NDV might not be easily populated from partition-level statistics. Thanks, Gang On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE wrote: > Thanks for raising the discussion Hu

Re: [ANNOUNCE] Welcoming new committers and PMC members

2024-07-23 Thread Gang Wu
Congrats! On Tue, Jul 23, 2024 at 10:17 PM Russell Spitzer wrote: > "so many" :) > > On Tue, Jul 23, 2024 at 9:14 AM Russell Spitzer > wrote: > >> This is truly an exciting day. To have to many qualified folks being >> recognized by the Iceberg project fills me with pride. I can't wait to see >

Re: [Discuss] Geospatial Support

2024-06-05 Thread Gang Wu
> The min/max stats are discussed in the doc (Phase 2), depending on the non-trivial encoding. Just want to add that min/max stats filtering could be supported by file format natively. Adding geometry type to parquet spec is under discussion: https://github.com/apache/parquet-format/pull/240 Best

Re: [Early Feedback] Variant and Subcolumnarization Support

2024-05-14 Thread Gang Wu
> We may need some guidance on just how many we need to look at; > we were planning on Spark and Trino, but weren't sure how much > further down the rabbit hole we needed to go。 There are some engines living outside the Java world. It would be good if the proposal could cover the effort it takes t

Re: [Early Feedback] Variant and Subcolumnarization Support

2024-05-10 Thread Gang Wu
Hi, This sounds very interesting! IIUC, the current variant type in the Apache Spark stores data in the BINARY type. When it comes to subcolumnarization, does it require the file format (e.g. Apache Parquet/ORC/Avro) to support variant type natively? Best, Gang On Sat, May 11, 2024 at 1:07 PM T