Re: FileRewrite API refactor

2025-02-04 Thread Steven Wu
At a high level, it makes sense to separate out the planning and execution to promote reusing the planning code across engines. Just to add 4th class to Russel's list 1) RewriteGroup: A Container that holds all the files that are meant to be compacted along with information about them 2) Rewriter:

Re: [VOTE] Update partition stats spec for V3

2025-02-02 Thread Steven Wu
+1 The spec change makes sense. left a question in the PR. On Sun, Feb 2, 2025 at 8:52 PM roryqi wrote: > +1 > > Amogh Jahagirdar <2am...@gmail.com> 于2025年2月2日周日 10:16写道: > >> +1 >> >> On Sat, Feb 1, 2025 at 11:05 AM huaxin gao >> wrote: >> >>> +1 (non-binding) >>> >>> On Sat, Feb 1, 2025 at 8

Re: [DISCUSS/VOTE] Add in ChangeLog Reserved Field IDs to Spec and Decrement Row Lineage Reserved IDs

2025-01-26 Thread Steven Wu
+1 On Sun, Jan 26, 2025 at 3:01 PM John Zhuge wrote: > +1 (non-binding) > > John Zhuge > > > On Sun, Jan 26, 2025 at 2:59 PM Aihua Xu wrote: > >> +1 (non-binding). >> >> Thanks for fixing it. >> >> On Sun, Jan 26, 2025 at 11:30 AM Anton Okolnychyi >> wrote: >> >>> +1 good catch >>> >>> нд, 26

Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-15 Thread Steven Wu
+1 On Wed, Jan 15, 2025 at 9:00 AM Russell Spitzer wrote: > Hi Everyone! > > PR: https://github.com/apache/iceberg/pull/11976/files > > Split out from #11948 > > Working on the row-lineage implementation made it clear that we needed a > way to get i

Re: [DISCUSS] - Deprecate Equality Deletes

2024-11-01 Thread Steven Wu
job should fail in this case. On Fri, Nov 1, 2024 at 10:57 AM Steven Wu wrote: > Shani, > > That is a good point. It is certainly a limitation for the Flink job to > track the inverted index internally (which is what I had in mind). It can't > be shared/synchronized with other

Re: [DISCUSS] - Deprecate Equality Deletes

2024-11-01 Thread Steven Wu
50 AM Shani Elharrar wrote: > Even if Flink can create this state, it would have to be maintained > against the Iceberg table, we wouldn't like duplicates (keys) if other > systems / users update the table (e.g manual insert / updates using DML). > > Shani. > > On 1 Nov 2

Re: [VOTE] Release Apache Iceberg 1.7.0 RC0

2024-11-01 Thread Steven Wu
+1 (binding) Verified signature, checksum, license. Did Flink SQL local testing with the runtime jar. Didn't run build because Azure FileIO testing requires Docker environment. On Fri, Nov 1, 2024 at 5:02 AM Fokko Driesprong wrote: > Thanks Russel for running this release! > > +1 (binding) > >

Re: [DISCUSS] - Deprecate Equality Deletes

2024-11-01 Thread Steven Wu
;>> row by UUID. With position deletes each delete is expensive without an >>> index on that UUID. >>> With equality deletes each delete is cheap and while reads/compaction is >>> expensive but when updates are frequent and reads are sporadic that's a >>>

Re: [DISCUSS] - Deprecate Equality Deletes

2024-10-31 Thread Steven Wu
We probably all agree with the downside of equality deletes: it postpones all the work on the read path. In theory, we can implement position deletes only in the Flink streaming writer. It would require the tracking of last committed data files per key, which can be stored in Flink state (checkpoi

Re: [VOTE] Deletion Vectors in V3

2024-10-30 Thread Steven Wu
+1 On Wed, Oct 30, 2024 at 1:07 AM xianjin wrote: > +1 (non binding) > > On Wed, Oct 30, 2024 at 2:28 PM Jean-Baptiste Onofré > wrote: > >> +1 (non binding) >> >> Regards >> JB >> >> On Tue, Oct 29, 2024 at 10:45 PM Anton Okolnychyi >> wrote: >> > >> > Hi folks, >> > >> > We have been discussi

Re: [Discuss] Different file formats for ingestion and compaction

2024-10-25 Thread Steven Wu
agree with Ryan. Engines usually provide override capability that allows users to choose a different write format (than table default) if needed. There are many production use cases that write columnar formats (like Parquet) in streaming ingestion. I don't necessarily agree that it will be common

Re: [DISCUSS] Remove iceberg-pig module ?

2024-10-17 Thread Steven Wu
+1 On Thu, Oct 17, 2024 at 10:44 AM John Zhuge wrote: > +1 (non-binding) > > On Thu, Oct 17, 2024 at 10:21 AM Yufei Gu wrote: > >> +1 for deprecating it in 1.7 >> Yufei >> >> >> On Thu, Oct 17, 2024 at 9:51 AM Ajantha Bhat >> wrote: >> >>> +1 for dropping it. >>> >>> On Thu, Oct 17, 2024 at 8:

Re: [VOTE] Table V3 Spec: Row Lineage

2024-10-10 Thread Steven Wu
+1 On Thu, Oct 10, 2024 at 2:52 PM Yufei Gu wrote: > +1 > Yufei > > > On Thu, Oct 10, 2024 at 3:47 PM Amogh Jahagirdar <2am...@gmail.com> wrote: > >> +1, I've been reviewing this proposal/spec change for a bit and I think >> it's in a good state for the community to work on an implementation. >>

Re: Iceberg View Spec Improvements

2024-10-08 Thread Steven Wu
t case (in the doc they are more abstract/generic >> than the version you shared). Would be great to provide your feedback on >> the assumptions in the doc. >> >> Thanks, >> Walaa. >> >> >> On Tue, Oct 8, 2024 at 9:40 AM Steven Wu wrote: >> >>

Re: Iceberg View Spec Improvements

2024-10-08 Thread Steven Wu
I like to follow up on Russel's suggestion of using a federated catalog for resolving the catalog name/alias problem. I think Russel's idea is that the federated catalog standardizes the catalog names (for referencing). That could solve the problem. There are two cases/ (1) single catalog: there i

Re: [DISCUSS] Iceberg Summit 2025 ?

2024-10-02 Thread Steven Wu
Regarding content, we can have multiple tracks. - technology deep dive: how things work internally especially with new features and innovations - ecosystem: interesting learnings and development from ecosystem integrations - use cases: success story, learnings, limitations from different industries

Re: [DISCUSS] Iceberg Materialzied Views

2024-10-01 Thread Steven Wu
ue I'd like to bring up about using UUIDs which is >> that these UUIDs are client generated and there's no validation that they >> are indeed globally unique identifiers. The catalog just persists whatever >> it is given without validating that the UUIDs are indeed

Re: [DISCUSS] Iceberg Summit 2025 ?

2024-09-28 Thread Steven Wu
+1 for hybrid with in-person elements. On Sat, Sep 28, 2024 at 4:23 PM Matt Topol wrote: > +1 from me as well, I would love to attend an in person/hybrid iceberg > summit. Workshops seem like a perfect way to help the community. > > On Sat, Sep 28, 2024, 7:11 PM Honah J. wrote: > >> +1 on hosti

Re: [DISCUSS] Modify ThreadPools.newWorkerPool to avoid unnecessary Shutdown Hook registration

2024-09-27 Thread Steven Wu
24 at 1:52 AM Jean-Baptiste Onofré > wrote: > >> Hi Steven, >> >> I agree with you here. I think we can use semantics similar to >> ThreadPoolExecutor/ScheduledThreadPoolExecutor (like >> newFixedThreadPool, newWorkStealingPool, ...). >> >> Regards >

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-25 Thread Steven Wu
t;sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM >>>>>>>> events\nGROUP BY 2", >>>>>>>> >>>>>>>> "dialect": "spark", >>>>>>>> >>>>>>>>

Re: [DISCUSS] Modify ThreadPools.newWorkerPool to avoid unnecessary Shutdown Hook registration

2024-09-25 Thread Steven Wu
First, we should definitely add Javadoc to `ThreadPools.newWorkerPool` on its behavior with a shutdown hook. It is not obvious from the method name. I would actually go further to deprecate `newWorkerPool` with `newExitingWorkerPool`. `newWorkerPool` method name is easy to cause the misuage, as the

Re: [VOTE] Drop Python3.8 Support in PyIceberg 0.8.0

2024-09-23 Thread Steven Wu
+1 (binding). makes sense. On Mon, Sep 23, 2024 at 9:38 AM Yufei Gu wrote: > +1 Thanks for bringing this up. > > Yufei > > > On Mon, Sep 23, 2024 at 9:27 AM Kevin Liu wrote: > >> +1 non-binding. Thanks for starting this conversation! >> >> >> On Fri, Sep 20, 2024 at 2:02 PM Sung Yun wrote: >>

Re: Code structuring question

2024-09-19 Thread Steven Wu
I'll share my take on this. My first choice would be leveraging the Java access modifiers, which enforce the visibility by the programming language. Users won't see non-public classes at all. That is best for the users. Peter mentioned the potential downside of collocating 50 classes under one pac

Re: [Discuss] test logging is broken and Avro 1.12.0 upgraded slf4j-api dep to 2.x

2024-09-16 Thread Steven Wu
ase, I think the best path forward is to upgrade to 2.x but not >> use the new API features that will cause problems if downstream libraries >> are not already on 2.x. >> >> Does that sound reasonable? >> >> On Wed, Sep 11, 2024 at 11:17 AM Steven Wu wrote: &g

Re: [Discuss] test logging is broken and Avro 1.12.0 upgraded slf4j-api dep to 2.x

2024-09-11 Thread Steven Wu
lf4j-api has never been broken."* On Mon, Sep 9, 2024 at 9:22 AM Steven Wu wrote: > Bump the thread to bring the awareness of the issue and implication of > slf4j 2.x upgrade. > > On Mon, Aug 26, 2024 at 12:24 PM Steve Zhang > wrote: > >> I believe dependabot tried

Re: [DISCUSS] September board report

2024-09-10 Thread Steven Wu
> Flink Range distribution for Sinks It is already included in Ryan's draft > Flink Source V2 improvements and V1 deprecation to prepare for Flink 2.0 This is still ongoing. There is a blocking issue with FileIOParser on HadoopFileIO: https://github.com/apache/iceberg/pull/10926 On Tue, Sep 10

Re: [Discuss] test logging is broken and Avro 1.12.0 upgraded slf4j-api dep to 2.x

2024-09-09 Thread Steven Wu
gt; > [1]https://github.com/apache/iceberg/pull/9688 > > Thanks, > Steve Zhang > > > > On Aug 24, 2024, at 7:37 PM, Steven Wu wrote: > > Hi, > > It seems that test logging is broken in many modules (like core, flink) > because slf4j-api was upgraded to 2.x while sl

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-03 Thread Steven Wu
Walaa, for the listed discussion points, how should we move forward? should we have another MV sync meeting? BTW, Jan's latest spec PR addressed my comment on UUID. On Mon, Sep 2, 2024 at 4:35 PM Walaa Eldin Moustafa wrote: > Hi Jan, > > I think we need further discussion for a few reasons: > >

Re: [ANNOUNCE] Apache Iceberg release 1.6.1

2024-08-28 Thread Steven Wu
Thanks Carl for driving this release! On Wed, Aug 28, 2024 at 8:34 AM Carl Steinbach wrote: > I'm pleased to announce the release of Apache Iceberg 1.6.1! > > Apache Iceberg is an open table format for huge analytic datasets. Iceberg > delivers high query performance for tables with tens of peta

Re: [VOTE] Merge guidelines for committing PRs

2024-08-28 Thread Steven Wu
+1 (binding) On Wed, Aug 28, 2024 at 9:29 AM Micah Kornfield wrote: > I propose to merge https://github.com/apache/iceberg/pull/10780 as a > starting place for describing community norms around merging/discussing PRs. > > We've discussed this [1] and gone through a bunch of revisions on the PR >

Re: [DISCUSS] Improving Position Deletes in V3

2024-08-26 Thread Steven Wu
Anton, Thanks a lot for the improvement proposal and great write-up with quantitative supporting arguments. +1 from my side. Thanks, Steven On Wed, Aug 21, 2024 at 2:29 PM Anton Okolnychyi wrote: > Hey folks, > > As discussed during the sync, I've been working on a proposal to improve > the

[Discuss] test logging is broken and Avro 1.12.0 upgraded slf4j-api dep to 2.x

2024-08-24 Thread Steven Wu
Hi, It seems that test logging is broken in many modules (like core, flink) because slf4j-api was upgraded to 2.x while slf4j-simple provider is still on 1.7. I created a PR that upgraded slf4j-simple testImplementation to 2.x for all subprojects. https://github.com/apache/iceberg/pull/11001 Tha

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Steven Wu
og table, i.e., .changes) to get a DataFrame which is > saved to a temporary Spark view which can then be queried; if net_changes > is true, only the net changes are produced for this temporary view. This > functionality uses ChangelogIterator.removeNetCarryovers (which is in > Spark)

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Steven Wu
1c,INSERTED > > I think the (a) is the correct behaviour. > > Thanks, > Peter > > Steven Wu ezt írta (időpont: 2024. aug. 21., Sze, > 22:27): > >> Agree with everyone that option (a) is the correct behavior. >> >> On Wed, Aug 21, 2024 at 11:57 AM Steve Zha

Re: clarification on changelog behavior for equality deletes

2024-08-21 Thread Steven Wu
Agree with everyone that option (a) is the correct behavior. On Wed, Aug 21, 2024 at 11:57 AM Steve Zhang wrote: > I agree that option (a) is what user expects for row level changes. > > I feel the added deletes in given snapshots provides a PK of DELETED > entry, existing deletes are used to re

Re: [VOTE] Release Apache Iceberg 1.6.1 RC1

2024-08-20 Thread Steven Wu
> TestHadoopCommits > testConcurrentFastAppends(File) FAILED I have seen this test a bit flaky before On Tue, Aug 20, 2024 at 6:31 AM Jean-Baptiste Onofré wrote: > +1 (non binding) > > I checked: > - download links are OK (both on dist and Maven Staging repo) > - build passed on the tag using J

Re: [VOTE] Spec changes in preparation for v3

2024-08-19 Thread Steven Wu
+1 (binding) On Mon, Aug 19, 2024 at 4:06 PM Anton Okolnychyi wrote: > +1 (binding) > > - Anton > > пн, 19 серп. 2024 р. о 13:49 John Zhuge пише: > >> +1 (non-binding) >> >> On Mon, Aug 19, 2024 at 1:34 PM Yufei Gu wrote: >> >>> +1 >>> Yufei >>> >>> >>> On Mon, Aug 19, 2024 at 1:17 PM Fokko Dr

Re: Welcome Péter, Amogh and Eduard to the Apache Iceberg PMC

2024-08-13 Thread Steven Wu
Congratulations! On Tue, Aug 13, 2024 at 7:18 PM Kevin Liu wrote: > Congratulations all! > > On Wed, Aug 14, 2024 at 9:38 AM Jun H. wrote: > >> Congratulations! >> >> >> On Tue, Aug 13, 2024 at 4:28 PM Rodrigo Meneses >> wrote: >> >>> This is amazing! Congratulations!!! >>> >>> On Tue, Aug 13,

Re: [DISCUSS] Flink 1.20: make FLIP-27 default in SQL and mark the old FlinkSource as deprecated

2024-08-12 Thread Steven Wu
ake it explicit in the >> changelog, and if possible give some hints on how to drain the Flink jobs. >> >> Kind regards, >> Fokko >> >> Op ma 12 aug 2024 om 04:57 schreef Steven Wu : >> >>> >>> *What* >>> >>> In the next Ice

[DISCUSS] Flink 1.20: make FLIP-27 default in SQL and mark the old FlinkSource as deprecated

2024-08-11 Thread Steven Wu
*What* In the next Iceberg 1.7 release with Flink 1.20 support [1], I am proposing to make the following changes for *Flink* *1.20 only* . 1. Mark the old `FlinkSource` as deprecated and redirect users to the FLIP-27 `IcebergSource` in the Javadoc. 2. Make the FLIP-27 source the default for Flin

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

2024-08-07 Thread Steven Wu
I also like the middle ground of partition level stats, which is also easier to perform incremental refresh (at partition level). if the roll-up of partition level stats turned out to be slow, I don't mind adding table level stats aggregated from partition level stats. Having partition level stats

Re: Flink Table Maintenance - Tag based locking

2024-08-04 Thread Steven Wu
I also don't feel it is the best fit to use tags to implement locks for passing control messages. This is the main sticking point for me from the design doc. However, we haven't been able to come up with a better solution yet. Maybe we need to go back to the drawing board again. I am also not sure

Re: [VOTE] Clarify "File System Tables" in the table spec

2024-08-01 Thread Steven Wu
+1 (binding) On Thu, Aug 1, 2024 at 10:16 AM Ryan Blue wrote: > Adding my own +1 > > On Thu, Aug 1, 2024 at 9:52 AM Robert Stupp wrote: > >> +1 (nb) >> On 01.08.24 18:17, Yufei Gu wrote: >> >> +1 (binding) >> Yufei >> >> >> On Thu, Aug 1, 2024 at 8:33 AM Daniel Weeks wrote: >> >>> Added commen

Re: [DISCUSS] Describing REST Server capabilities

2024-07-30 Thread Steven Wu
> (2) version the entire catalog spec. A released catalog spec version will contain a list of configs it supports, and also a set of APIs and all features embedded in the APIs. A server will report the specific catalog version it adheres to, and then document the nuances. Jack, just to clarify, a

Re: [VOTE] Drop Java 8 support in Iceberg 1.7.0

2024-07-26 Thread Steven Wu
+1 (binding) I would also suggest keeping the vote open for 7 days for a larger decision like this. On Fri, Jul 26, 2024 at 8:50 AM Ryan Blue wrote: > +1 > > On Fri, Jul 26, 2024 at 8:42 AM Russell Spitzer > wrote: > >> +1 (bind) >> >> On Fri, Jul 26, 2024 at 8:34 AM Péter Váry >> wrote: >>

Re: [ANNOUNCE] Apache Iceberg release 1.6.0

2024-07-25 Thread Steven Wu
JB, thank you for driving the release. Awesome job! And many thanks to all the contributors! On Thu, Jul 25, 2024 at 8:24 AM Jack Ye wrote: > Thanks JB for organizing the release, and thanks everyone that has > contributed! > > -Jack > > On Thu, Jul 25, 2024 at 8:07 AM Jean-Baptiste Onofré > w

Re: Java String to Expression Util?

2024-07-25 Thread Steven Wu
seems like you need a SQL expression parser. I guess in the Java world we don't have a need yet. Engines (Spark, Flink etc.) have their own SQL parsers. They may leverage Calcite for that. On Thu, Jul 25, 2024 at 2:16 PM Pucheng Yang wrote: > The motivation is that users usually need to define w

Re: [ANNOUNCE] Welcoming new committers and PMC members

2024-07-24 Thread Steven Wu
Congrats to everyone. On Wed, Jul 24, 2024 at 9:40 AM karuppayya wrote: > Congratulations everyone! > > On Wed, Jul 24, 2024 at 8:27 AM Péter Váry > wrote: > >> Congratulations all! >> >> Bryan Keller ezt írta (időpont: 2024. júl. 24., Sze, >> 16:21): >> >>> Congrats all! >>> >>> On Jul 24, 20

Re: Dropping JDK 8 support

2024-07-22 Thread Steven Wu
+1 (binding) On Mon, Jul 22, 2024 at 6:37 AM Piotr Findeisen wrote: > Hi, > > in the "Building with JDK 21" email thread we discussed adding JDK 21 > support and also dropping JDK 8 support, as these things were initially > related. > A lot of people expressed acceptance for dropping JDK 8 suppo

Re: [VOTE] Merge table spec clarifications on time travel and equality deletes

2024-07-18 Thread Steven Wu
I am +1 for the spec clarifications. I have left some comments for the time travel PR. we can discuss the details in the PR itself before merging. In particular, I am wondering if the time travel clarification can be add to the existing `snapshots` section of the spec (instead of adding a new `imp

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-18 Thread Steven Wu
Thanks Jack for the thoughtful comments. I am not fully sold that object storage issues have been solved. S3 directory bucket is not a general purpose bucket and lives in a single zone. The data durability guarantee may not work for many use cases. We don't know when S3 will add the atomic renamin

Re: [DISCUSS] Flink unaligned checkpoints

2024-07-18 Thread Steven Wu
Regarding unaligned checkpoints, Flink savepoint is always aligned and recommended for Flink version upgrade. We can potentially recommend users to use Flink savepoint to pick up this fix. I will take a closer look at the PR. On Thu, Jul 18, 2024 at 6:22 AM Péter Váry wrote: > Hi Team, > > Qish

Re: Re: Re: Core:support redis and http lock-manager

2024-07-16 Thread Steven Wu
Lisoda, HadoopCatalog has many issues for production usage like Dan said. It has never been recommended in production. It was widely used in unit test code, which is also slowly moving toward InMemoryCatalog. As the community is aligned behind the REST catalog, it is preferable to limit the work re

Re: [VOTE] spec: remove the JSON spec for content file and file scan task sections

2024-07-15 Thread Steven Wu
, 2024 at 3:37 AM Piotr Findeisen < >>>>>>>> piotr.findei...@gmail.com> wrote: >>>>>>>> >>>>>>>>> it looks it's part of the spec that's not connected to the other >>>>>>>>> parts of t

[VOTE] spec: remove the JSON spec for content file and file scan task sections

2024-07-10 Thread Steven Wu
Following the latest community guidelines, I would like to start a voting thread on removing the JSON spec for content file and file scan task. Here is the PR for the spec change [1] This was previously discussed in the dev mailing list [2]. While it is good to add the JSON serializer in iceberg-c

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread Steven Wu
I am not totally convinced of the motivation yet. I thought the snapshot retention window is primarily meant for time travel and troubleshooting table changes that happened recently (like a few days or weeks). Is it valuable enough to keep expired snapshots for as long as months or years? While m

Re: Building with JDK 21

2024-07-09 Thread Steven Wu
+1 for dropping Java 8 as it is really old (3 versions behind latest Java LTS version - 11, 17, 21). Here was the thread from over a year ago on dropping JDK 8 support. https://lists.apache.org/thread/gq08prw7tv8q6h71lfc9bjlj074ckccv Hive's lack of Java 11 support was identified as a blocker. The

Re: Proposal to support cherrypick static overwrite

2024-05-30 Thread Steven Wu
Pucheng, I am not sure about others. At least I had some hard time understanding what the problem/proposal is. What is "cherrypick static partition overwrite"? Thanks, Steven On Thu, May 30, 2024 at 11:59 AM Pucheng Yang wrote: > Hi community, > > I would like to follow up on this proposal and

Re: Addressing security questions in the Iceberg REST specification

2024-05-29 Thread Steven Wu
ion mechanisms can be plugged in to talk to the REST catalog >> >> 3. refactor and make OAuth2 an implementation of that interface. I can >> also help with doing the same for AWS Sigv4, and the community can further >> support some additional ones like Kerberos, SAML, Google SS

Re: Addressing security questions in the Iceberg REST specification

2024-05-29 Thread Steven Wu
Wondering if the auth endpoints can be separated out to a separate OpenAPI spec file. Then we still have some reference for interactions with auth server and make it clear it is not required as part of the REST catalog server. In most enterprise environments, auth server is likely a separate server

Re: Iceberg Summit Video not accessible from non-registered users

2024-05-23 Thread Steven Wu
I heard the recordings will be uploaded to Youtube in a few weeks for public access. On Thu, May 23, 2024 at 5:46 PM Pucheng Yang wrote: > Hi, > > My co-workers who did not register for the iceberg summit would like to > check the videos. However, it seems the registration is closed hence they >

Re: [DISCUSS] camel-iceberg component

2024-05-22 Thread Steven Wu
seems reasonable to keep camel-iceberg inside camel project, which already has many integration components. +1 for that. On Wed, May 22, 2024 at 8:58 AM Ajantha Bhat wrote: > +1, > > It is always good to have new ways to ingest data as an Iceberg table. > > - Ajantha > > On Wed, May 22, 2024 at

Re: Can I get started with a plain java app?

2024-05-16 Thread Steven Wu
Iceberg has a Java library, which is the most complete implementation of the spec (compared to other languages like Python, Rust) at the moment. You can certainly use the Java library directly to write and commit data to Iceberg. But you will likely need to implement quite a bit of code for things

Re: Materialized Views: Next Steps

2024-05-07 Thread Steven Wu
Walaa, thanks for initiating the next step. With the agreed model of separate view and storage table, I am wondering if a separate materialized view spec page is needed. E.g., the new view metadata (view-materialized and view-storage-table) is probably good to be added to the view page directly to

Re: [Proposal] Add support for Flink Maintenance in Iceberg

2024-05-04 Thread Steven Wu
+1 for the proposal of adding more table maintenance to Flink. It is great that the maintenance actions can be run in two modes: (1) embedded in the Flink writer job as post commit stage (2) standalone Flink batch job/action. I probably wouldn't label the two goals as primary and secondary. Differ

Re: FlinkFileIO implementation

2024-04-25 Thread Steven Wu
agree with Dan that ResolvingFileIO solves a different problem to resolve FileIO based on storage schema (like s3) and is probably not a good fit for what Peter is trying to do. I also mentioned some downsides of FlinkFileSystemFileIO in the PR. It doesn't support batch deletes or progressive uplo

Re: [DISCUSS] spec: remove the file scan task JSON serialization section from table spec

2024-04-22 Thread Steven Wu
k > > On Wed, Feb 21, 2024 at 3:47 PM Steven Wu wrote: > >> here is the PR for spec update: >> https://github.com/apache/iceberg/pull/9771 >> >> > Was there any prior discussions on devlist for adding it to the spec? >> >> Jack, there is no separate discu

Re: New committer: Renjie Liu

2024-03-09 Thread Steven Wu
Congrats, Renjie! On Sat, Mar 9, 2024 at 7:18 AM himadri pal wrote: > Congratulations Renjie. > > Regards, > Himadri Pal > > > On Fri, Mar 8, 2024 at 11:56 PM Fokko Driesprong wrote: > >> Hi everyone, >> >> The Project Management Committee (PMC) for Apache Iceberg has invited >> Renjie Liu to b

Re: New committer: Bryan Keller

2024-03-05 Thread Steven Wu
Bryan, congratulations and thank you for your many contributions. On Tue, Mar 5, 2024 at 5:54 AM Bryan Keller wrote: > Thanks everyone! I really appreciate it, Iceberg has been inspiring to me, > both the project itself and the people involved, so I’m thankful to have > been given the opportunit

Re: Flink: uncommitted data files and garbage collection

2024-02-29 Thread Steven Wu
p. old uncommitted data files can still be garbage collected, which is the problem. On Thu, Feb 29, 2024 at 8:58 AM Steven Wu wrote: > We are probably off the topic of the original thread. I am moving the > Flink part of the discussion to a new thread/subject. > > > but the prepared

Flink: uncommitted data files and garbage collection

2024-02-29 Thread Steven Wu
We are probably off the topic of the original thread. I am moving the Flink part of the discussion to a new thread/subject. > but the prepared and not yet committed data files are also present in their final place. These data files are also not part of the table yet, and could be removed by the or

Re: [DISCUSS] spec: remove the file scan task JSON serialization section from table spec

2024-02-21 Thread Steven Wu
be safely removed since it is > not a contract that we are committed to maintaining. > > Ryan > > On Wed, Feb 21, 2024 at 1:30 PM Jack Ye wrote: > >> Was there any prior discussions on devlist for adding it to the spec? >> Could you help link those conversations? &g

[DISCUSS] spec: remove the file scan task JSON serialization section from table spec

2024-02-21 Thread Steven Wu
In the recent PR review [1], Ryan and emkornfield has raised a question why file scan task JSON serialization was added to the table spec [2]. We seems to have a consensus that it *shouldn't* have been added to the table spec. Now the question is what's the process of removing an invalid section f

Re: [Discuss] add a new task-type to file scan task JSON serialization

2024-02-16 Thread Steven Wu
ice, just saw that. > > We are adding the definitions as a part of ttps:// > github.com/apache/iceberg/pull/9695, we can help review the PRs listed > here and then update the OpenAPI spec accordingly. > > -Jack > > On Wed, Feb 14, 2024 at 4:12 PM Steven Wu wrote: > >> @

Re: [Discuss] add a new task-type to file scan task JSON serialization

2024-02-14 Thread Steven Wu
gt; table metadata in the REST spec? I'm actually pretty happy with OpenAPI for >> defining our JSON structures, so I think this would be easier in the REST >> spec. I would also consider an OpenAPI extension to the table spec for JSON >> objects since it is pretty easy to

Re: [Discuss] add a new task-type to file scan task JSON serialization

2024-02-14 Thread Steven Wu
The first linked reference is the PR for spec update. [3] https://github.com/apache/iceberg/pull/9728 On Wed, Feb 14, 2024 at 3:36 PM Steven Wu wrote: > We just ran out of time and didn't get a chance to discuss this in the > community sync meeting today. Hence, I am raising the

[Discuss] add a new task-type to file scan task JSON serialization

2024-02-14 Thread Steven Wu
We just ran out of time and didn't get a chance to discuss this in the community sync meeting today. Hence, I am raising the discussion here. We added JSON parsers for content file and file scan task a year ago [1]. Recently, I just realized the implementation only handles BaseFileScanTask. It wou

Re: flink-1.18 maven files are not available

2024-02-01 Thread Steven Wu
Not available yet. Flink 1.18 support is only added in the upcoming Iceberg 1.5.0 release. On Thu, Feb 1, 2024 at 8:50 AM Jacek Juraszek wrote: > Hi, > > I’ve found out that that there is no > https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.18/ > . > Do you guys suppor

Re: [DISCUSS] Iceberg community summit

2024-01-16 Thread Steven Wu
Happy to volunteer for the selection committee too. Looking forward to a great event! On Tue, Jan 16, 2024 at 5:30 PM John Zhuge wrote: > Happy to volunteer and help out where I can. > > On Mon, Jan 15, 2024 at 1:46 AM Eduard Tudenhoefner > wrote: > >> I would also like to volunteer and help ou

Re: [ANNOUNCE] New committer: Honah J.

2024-01-14 Thread Steven Wu
Congrats, Honah! On Sun, Jan 14, 2024 at 8:06 PM Daniel Weeks wrote: > Congratulations Honah! Great work on pyiceberg. I'm really excited about > where that effort is headed. > > -Dan > > On Sun, Jan 14, 2024, 6:45 PM OpenInx wrote: > >> Congrats, Honah ! >> >> On Sun, Jan 14, 2024 at 1:25 AM

Re: [ANNOUNCE] Apache Iceberg release 1.4.1

2023-10-23 Thread Steven Wu
Thanks Eduard for driving the release! On Mon, Oct 23, 2023 at 8:35 AM Eduard Tudenhoefner < etudenhoef...@apache.org> wrote: > I'm pleased to announce the release of *Apache Iceberg 1.4.1*! > > Apache Iceberg is an open table format for huge analytic datasets. Iceberg > delivers high query perfo

Re: [DISCUSS] Apache Iceberg 1.4.1

2023-10-17 Thread Steven Wu
PR is merged https://github.com/apache/iceberg/pull/8848 On Mon, Oct 16, 2023 at 4:02 PM Steven Wu wrote: > Hi Eduard, > > I would like to include this change/revert for the 1.4.1 patch release. > https://github.com/apache/iceberg/issues/8847 > > PR is not created yet. i

Re: [DISCUSS] Apache Iceberg 1.4.1

2023-10-16 Thread Steven Wu
Hi Eduard, I would like to include this change/revert for the 1.4.1 patch release. https://github.com/apache/iceberg/issues/8847 PR is not created yet. it should be a very small change in one file. Thanks, Steven On Mon, Oct 16, 2023 at 6:50 AM Jean-Baptiste Onofré wrote: > Hi Eduard > > It s

Re: Scan column metrics

2023-10-07 Thread Steven Wu
such a large table to solve the problem, rather than filter them > out (which is actually not very quick anyway). Have you tried that? Why > doesn't it work? > > On Thu, Oct 5, 2023 at 7:57 PM Steven Wu wrote: > >> It is definitely good to only track column stats that are used.

Re: Scan column metrics

2023-10-05 Thread Steven Wu
d for operations like > this, right? Are you keeping metrics around for metadata agg pushdown or > something? > > I'm not opposed, but I do want to make sure there's not a simple way to > solve the problem. > > On Thu, Oct 5, 2023 at 3:23 PM Steven Wu wrote: > >

Re: Scan column metrics

2023-10-05 Thread Steven Wu
+1 for this feature of column stats projection. I will add some additional inputs. 1) In the previous discussion, there are comments on only enabling column stats that are needed. That is definitely a recommended best practice. But there are some practical challenges. By default, Iceberg enables

Re: Discussion about the location of language clients

2023-08-10 Thread Steven Wu
I am also on the side of separate repos for different languages. otherwise, the main repo can grow too big. iceberg.apache.org website can provide proper links to repos for different languages. I would be -1 on renaming apache/iceberg to apache/iceberg-java, as it can break external links to the m

Re: [DISCUSS] Flink range partitioner/shuffle

2023-06-05 Thread Steven Wu
fields work? > > I think you might want to make it simpler by just using a SortKey, like > you mentioned. > > On Sat, Jun 3, 2023 at 8:03 AM Steven Wu wrote: > >> Ryan, thanks a lot for the feedback. Will use `StructType` when >> applicable. >> >> `Partitio

Re: [DISCUSS] Flink range partitioner/shuffle

2023-06-03 Thread Steven Wu
that's basically > the same thing as `StructTransformation`, just with a different name. > > On Thu, Jun 1, 2023 at 3:31 AM Péter Váry > wrote: > >> Good point. >> Stick to the conventions then >> >> Steven Wu ezt írta (időpont: 2023. máj. 31., Sze, >> 17:14): >&g

Re: Iceberg old partition gc

2023-06-02 Thread Steven Wu
> the main use case I had was table historical analysis (last update time for each partitions, how many snapshots did this table ever have, for example), Partition level stats can probably help with questions like "last update time for each partition". @Szehon, I am wondering if we can create mat

Re: [DISCUSS] Flink range partitioner/shuffle

2023-05-31 Thread Steven Wu
chema for the `StructTransformation` object > instead, like `StructTransformation.schema()`? > > Steven Wu ezt írta (időpont: 2023. máj. 31., Sze, > 7:19): > >> We are implementing a range partitioner for Flink sink shuffling [1]. One >> key piece is RowDataComparator for Flink Ro

[DISCUSS] Flink range partitioner/shuffle

2023-05-30 Thread Steven Wu
We are implementing a range partitioner for Flink sink shuffling [1]. One key piece is RowDataComparator for Flink RowData. Would love to get some feedback on a few decisions. 1. Comparators for Flink `RowData` type. Flink already has the `RowDataWrapper` class that can wrap a `RowData` as a `Stru

Re: [DISCUSS] Default format version for new tables?

2023-05-25 Thread Steven Wu
+1. Anton made a good case with the new perspective. On Thu, May 25, 2023 at 2:29 PM Anton Okolnychyi wrote: > Oh, I missed the earlier discussion. Thanks for sharing it, Gabor! > > I am approaching this from a slightly different perspective. Defaulting to > v2 does not mean supporting delete fi

Re: Scan statistics

2023-05-19 Thread Steven Wu
The proposal here is essentially column stats projection pushdown. For some Flink jobs with watermark alignment, Flink source is only interested in the column stats (min-max) for one timestamp column. Hence the column stats projection can really help reduce memory footprint for wide tables (with hu

Re: Improve Change Data Capture Use Case for Iceberg

2023-05-04 Thread Steven Wu
Thanks Jack for the great write-up. Good summary of the current landscape of CDC too. Left a few comments to discuss. On Wed, Apr 26, 2023 at 11:38 AM Anton Okolnychyi wrote: > Thanks for starting a thread, Jack! I am yet to go through the proposal. > > I recently came across a similar idea in B

Re: Welcome new committers and PMC!

2023-05-03 Thread Steven Wu
Congrats, Amogh, Eduard, and Szehon! Well deserved. Your contributions are much appreciated! On Wed, May 3, 2023 at 2:25 PM Yufei Gu wrote: > Congratulations, Amogh, Eduard and Szehon! Great job! > > Best, > > Yufei > > > On Wed, May 3, 2023 at 12:27 PM Russell Spitzer > wrote: > >> Great news!

Re: Sequence number for ContentFiles

2023-04-26 Thread Steven Wu
; >> We also planned to expose file sequence number (different from data >> sequence number). I believe you could lookup snapshot using that info. >> >> https://iceberg.apache.org/spec/#manifest-entry-fields >> >> - Anton >> >> On Apr 26, 2023,

Re: Sequence number for ContentFiles

2023-04-26 Thread Steven Wu
piggyback on this thread since we are discussing exposing more metadata in ContentFile or FileScanTask. Flink source watermark alignment can potentially leverage the snapshot timestamp (when data files are committed/appended to the table). Is it reasonable to expose some snapshot metadata in the F

Re: [DISCUSS] Spark 3.1 support?

2023-04-21 Thread Steven Wu
; - Anton >> >> On Apr 21, 2023, at 3:58 PM, Ryan Blue wrote: >> >> Good question about backports. Walaa and Edgar, are you backporting fixes >> to 3.1? It makes sense to have a place to collaborate, but only if people >> are actively keeping them updated. >>

Re: [DISCUSS] Spark 3.1 support?

2023-04-21 Thread Steven Wu
For the 3.1 activities that Ryan linked, 3.1 are updated probably for the requirement of backporting (keeping 3.1, 3.2, 3.3 in sync). It is the adopted policy. Not sure if it is an indication that people are actively collaborating on 3.1. As Anton was saying, backporting/syncing 4 versions (3.1, 3

  1   2   >