[HiveCatalog] Skip updating column schema definition in HMS when schema string is longer than maxHiveTablePropertySize

2023-03-27 Thread Pucheng Yang
Hi community, We are using HiveCatalog (Iceberg 1.0 + Spark 3.2 + HMS 1.2.2 with MYSQL as its DB) and we encountered a column type mismatch failure when trying to update the table. The root cause is when creating the Iceberg table with a column that has a super long column type string, mysql seem

Re: Why is sort required for Spark writing to partitioned table

2023-04-25 Thread Pucheng Yang
Hi to confirm, In the doc, https://iceberg.apache.org/docs/1.0.0/spark-writes/#writing-to-partitioned-tables, it says "Explicit sort is necessary because Spark doesn’t allow Iceberg to request a sort before writing as of Spark 3.0. SPARK-23889 is

Re: Why is sort required for Spark writing to partitioned table

2023-04-25 Thread Pucheng Yang
t; in the table to request a distribution and ordering from Spark. Should be > supported both for batch and micro-batch writes. > > - Anton > > On Apr 25, 2023, at 11:05 AM, Pucheng Yang > wrote: > > Hi to confirm, > > In the doc, > https://iceberg.apache.org/docs/

Support create table like for Iceberg table?

2023-04-25 Thread Pucheng Yang
Hi all, I wonder how folks in the community deal with the cases where you want to create a test table from an existing iceberg table? In Hive, what we normally do is to run a query "create table x like y location z". But we can't do this for the Iceberg table. If this is a feature that is missing

Re: Support create table like for Iceberg table?

2023-04-25 Thread Pucheng Yang
in >> https://iceberg.apache.org/docs/latest/spark-ddl/#create-table >> >> Thanks, >> Steve Zhang >> >> >> >> On Apr 25, 2023, at 1:46 PM, Pucheng Yang >> wrote: >> >> Hi all, >> >> I wonder how folks in the community dea

Re: Support create table like for Iceberg table?

2023-04-27 Thread Pucheng Yang
re written and copy >> them. Look at Drop Table, maybe and see if you can copy the structure, but >> instead of dropping, load the table and call createTable with its metadata. >> >> On Tue, Apr 25, 2023 at 4:42 PM Pucheng Yang >> wrote: >> >>> Thanks Steve

Re: Support create table like for Iceberg table?

2023-04-27 Thread Pucheng Yang
gt;> think you can also specify explicit location as part of create statement in >>> https://iceberg.apache.org/docs/latest/spark-ddl/#create-table >>> >>> Thanks, >>> Steve Zhang >>> >>> >>> >>> On Apr 25, 2023, at 1:46 PM,

Re: [Proposal] Partition stats in Iceberg

2023-04-28 Thread Pucheng Yang
Hi Ajantha and the community, I am interested and I am wondering where we can see the latest progress of this feature? Regarding the partition stats in Iceberg, I am specifically curious if we can consider a new field called "last modified time" to be included for the partitions stats (or have a

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Pucheng Yang
ld be able to query the all_entries metadata table to see file >> additions or deletions for a given snapshot. Then from there you can join >> to the snapshots table for timestamps and aggregate to the partition level. >> >> Ryan >> >> On Fri, Apr 28, 2023 at 12:49

Re: Support create table like for Iceberg table?

2023-05-09 Thread Pucheng Yang
good idea to > break the contract of CREATE TABLE LIKE. > > - Anton > > On Apr 27, 2023, at 11:59 AM, Pucheng Yang > wrote: > > Hi Anton, > > Yes, I want to branch the table state and reuse the data files, but for > test purposes only. Imagine if we want to test

Re: Support create table like for Iceberg table?

2023-05-09 Thread Pucheng Yang
9, 2023 at 10:00 AM Russell Spitzer wrote: > How would Create Table Like, be different than our "Snapshot" procedure, > just enabled for Iceberg Tables? Wondering if we should just expand that > functionality. > > On Tue, May 9, 2023 at 11:54 AM Pucheng Yang > wrote: &g

Re: [Proposal] Partition stats in Iceberg

2023-05-15 Thread Pucheng Yang
> reconcile partition table and partition stats at some point though. Not > sure if it was designed/discussed yet, I think there was some thoughts on > short-circuiting Partitions table to read from Partition stats, if stats > exist for the current snapshot. > > > > Thanks &g

Iceberg old partition gc

2023-05-31 Thread Pucheng Yang
Hi community, In my organization, a big portion of the datasets are partitioned by date, normally we keep the latest X dates of partition for a given dataset. One issue that always bothers me is if I want to delete a partition that should be GC, I will run SQL query "delete from tbl where dt = ..

Re: Iceberg old partition gc

2023-06-01 Thread Pucheng Yang
ave something specific in mind? > > Ryan > > On Wed, May 31, 2023 at 8:19 PM Pucheng Yang > wrote: > >> Hi community, >> >> In my organization, a big portion of the datasets are partitioned by >> date, normally we keep the latest X dates of partition for a give

How to remove an Iceberg partition that only contains parquet files with 0 record

2023-06-29 Thread Pucheng Yang
Iceberg version: 1.3.0 Spark version: 3.2.1 Hi community, I have an interesting situation where I migrated a Hive table to Iceberg and this original Hive table has a partition containing parquet files without any record. The delete statement can not get rid of this partition, any suggestion on ho

Re: How to remove an Iceberg partition that only contains parquet files with 0 record

2023-06-30 Thread Pucheng Yang
alter table identifier drop > partition(partition_col_name=partition_col_value) > > Pucheng Yang 于2023年6月30日 周五11:13写道: > >> Iceberg version: 1.3.0 >> Spark version: 3.2.1 >> >> Hi community, >> >> I have an interesting situation where I migrated a

Re: How to remove an Iceberg partition that only contains parquet files with 0 record

2023-06-30 Thread Pucheng Yang
Thanks Russell, will take a look today. On Fri, Jun 30, 2023 at 7:08 AM wrote: > You probably will need to manually delete the file entry using the table > api from Java > > Sent from my iPhone > > On Jun 30, 2023, at 6:58 AM, Pucheng Yang > wrote: > >  > > Hi

Code review: [spark] skip empty file during table migration, table snapshotting or adding files

2023-07-11 Thread Pucheng Yang
Hi community, In a previous email, I asked about how to get rid of partitions that only contain empty files. Here I am proposing a PR https://github.com/apache/iceberg/pull/8040 (issue: https://github.com/apache/iceberg/issues/7949) to skip adding empty files during the migration, snapshotting or

Re: Code review: [spark] skip empty file during table migration, table snapshotting or adding files

2023-07-12 Thread Pucheng Yang
remove_empty_files procedure On Tue, Jul 11, 2023 at 10:07 AM Pucheng Yang wrote: > Hi community, > > In a previous email, I asked about how to get rid of partitions that only > contain empty files. > > Here I am proposing a PR https://github.com/apache/iceberg/pull/8040 > (issue: http

Cherrypick "delete" snapshot

2023-07-20 Thread Pucheng Yang
Hi community, I have a table that has the history below: null -> s1: overwrite (partition1) -> s2: overwrite (partition2) -> s3(current): delete (partition1). I want to undo the commit that generates s3 because it is a bad commit, and my goal is to have a history like below: null -> s1: overwri

Re: Cherrypick "delete" snapshot

2023-07-21 Thread Pucheng Yang
k the changes or if the delete would have > removed additional data files. > > To fix this, I think we just need to add the delete filter to the snapshot > so that we can re-run it to validate the result would be the same. Then we > can implement cherry-pick for delete operations. >

Make "compatibility.snapshot-id-inheritance.enabled" table prop default to True?

2023-08-31 Thread Pucheng Yang
Hi community, Table prop "compatibility.snapshot-id-inheritance.enabled" is introduced to avoid manifest rewrite if possible (PR: https://github.com/apache/iceberg/commit/c3dc9824b381e5e479e356be5e0f4fcf61a9fc37 ). During my recent investigation on a super long snapshot table creation on a huge t

Re: Make "compatibility.snapshot-id-inheritance.enabled" table prop default to True?

2023-08-31 Thread Pucheng Yang
rrectly, we still rewrite by default in v2 even though it's > safe. > > On Thu, Aug 31, 2023 at 9:40 AM Pucheng Yang > wrote: > >> Hi community, >> >> Table prop "compatibility.snapshot-id-inheritance.enabled" is introduced >> to a

Re: Make "compatibility.snapshot-id-inheritance.enabled" table prop default to True?

2023-08-31 Thread Pucheng Yang
Thanks Ryan, what might you consider an "older" version of Iceberg? Is it fair to say any version before https://github.com/apache/iceberg/commit/c3dc9824b381e5e479e356be5e0f4fcf61a9fc37 ? If that is the case, my organization controls the Iceberg reader so might be a less concern for me. Another o

Re: Make "compatibility.snapshot-id-inheritance.enabled" table prop default to True?

2023-08-31 Thread Pucheng Yang
y easy. > > And yes, versions of Iceberg older than the one where that config property > was added would be the ones where it is unsafe. It's probably safe for most > people, but we still can't change the default. > > On Thu, Aug 31, 2023 at 9:59 AM Pucheng Yang > wr

Re: [DISCUSS] Deprecate Spark 3.2 support?

2023-09-20 Thread Pucheng Yang
Like Linkedin (mentioned in another thread), Pinterest is on Spark 3.2 and there is no immediate plan to upgrade to the new Spark version as the migration cost is very high and the process is slow. What will be the implications of not having Spark-3.2 module in Iceberg any more? Based on your past

Re: [DISCUSS] Deprecate Spark 3.2 support?

2023-09-20 Thread Pucheng Yang
s if there are serious > issues to fix. > > On Wed, Sep 20, 2023 at 4:35 PM Pucheng Yang > wrote: > >> Like Linkedin (mentioned in another thread), Pinterest is on Spark 3.2 >> and there is no immediate plan to upgrade to the new Spark version as the >> migration cos

Re: [DISCUSS] Deprecate Spark 3.2 support?

2023-09-20 Thread Pucheng Yang
On Wed, Sep 20, 2023 at 5:02 PM Pucheng Yang > wrote: > >> Got it, so "deprecate spark 3.2 support" does not mean removing the spark >> 3.2 module in the Iceberg project right? >> >> And it also means maybe some new changes will only be available on Spark

Re: Migration of PyIceberg to iceberg-python repository

2023-09-29 Thread Pucheng Yang
Thanks for doing this. I wonder how do we deal with all the issues filed for python module but still open in iceberg repo? On Fri, Sep 29, 2023 at 7:55 AM Eduard Tudenhoefner wrote: > +1 on moving to a separate repo and maintaining git history > > On Fri, Sep 29, 2023 at 3:30 PM Jean-Baptiste On

MOR CDC view support

2023-11-02 Thread Pucheng Yang
Hi community, I wonder if anyone is interested in having a MOR CDC view feature? My organization is interested in using Flink upsert (MOR) into the Iceberg table, but currently the MOR CDC view is not supported. If we were to support it, do you know how much work it will be? How difficult will th

Re: MOR CDC view support

2023-11-02 Thread Pucheng Yang
Feature request ticket: https://github.com/apache/iceberg/issues/8975 On Thu, Nov 2, 2023 at 9:16 PM Pucheng Yang wrote: > Hi community, > > I wonder if anyone is interested in having a MOR CDC view feature? My > organization is interested in using Flink upsert (MOR) into the Ice

Why manifest rewrite only touches files that have latest spec id?

2023-12-06 Thread Pucheng Yang
Hi community, May I know why manifest rewrite will only touch files that have the latest spec id? What will be the suggestion if we want to rewrite manifest files that belong to non current spec id? Manifest selection logic: https://github.com/apache/iceberg/blob/6a9d3c77977baff4295ee2dde0150d73c

Re: Why manifest rewrite only touches files that have latest spec id?

2023-12-06 Thread Pucheng Yang
Based on what I understand, it seems there is no particular reason, and seems like a feature to be added on. On Wed, Dec 6, 2023 at 8:06 PM Pucheng Yang wrote: > Hi community, > > May I know why manifest rewrite will only touch files that have the latest > spec id? What will be th

Re: Why manifest rewrite only touches files that have latest spec id?

2023-12-07 Thread Pucheng Yang
7;s > probably some confusion over how to select a spec since we don't like users > to need to work with IDs in the format directly. > > On Wed, Dec 6, 2023 at 10:28 PM Pucheng Yang > wrote: > >> Based on what I understand, it seems there is no particular reason, and >

Re: Why manifest rewrite only touches files that have latest spec id?

2023-12-07 Thread Pucheng Yang
n't rewrite files across different partition specs > because the manifest files themselves would have a different schema for the > partition tuple. That makes passing the data around a bit harder if you > want to use data frames. > > On Thu, Dec 7, 2023 at 1:12 PM Pucheng Yang &g

Re: Apologies for multiple emails

2023-12-08 Thread Pucheng Yang
You can reply to one of the emails to redirect people to the other to avoid discussion in two places. On Fri, Dec 8, 2023 at 2:53 PM Drew wrote: > Sorry for all the emails! I had an issue with sending the email out the > other day with my proposal and it looks like the failed attempts ended up >

Re: [DISCUSS] PyIceberg 0.6.0 release

2024-01-26 Thread Pucheng Yang
I have similar questions as Yufei's. My organization has interest in Ray Iceberg integration and during the conversation with the Ray team, we know they would also like the have Iceberg integration as well. I think this is a good opportunity for both projects to collaborate. On Fri, Jan 26, 2024 a

Re: [VOTE] Release Apache PyIceberg 0.6.0rc1

2024-01-31 Thread Pucheng Yang
0.6.0 has been released already. Do you mean to release 0.6.1? On Wed, Jan 31, 2024 at 8:26 AM Sung Yun (BLOOMBERG/ 120 PARK) < syu...@bloomberg.net> wrote: > Hi Everyone, > > I propose that we release the following RC as the official PyIceberg 0.6.0 > release. > > A summary of the high level fea

Re: [VOTE] Release Apache PyIceberg 0.6.0rc1

2024-01-31 Thread Pucheng Yang
nvm, I was under the wrong impression it was released already. Thanks. On Wed, Jan 31, 2024 at 8:31 AM Pucheng Yang wrote: > 0.6.0 has been released already. Do you mean to release 0.6.1? > > On Wed, Jan 31, 2024 at 8:26 AM Sung Yun (BLOOMBERG/ 120 PARK) < > syu...@bloomberg.net&

Spark metadata deletion reliability issues of high memory usage cause s3 client failures, potentially due to cached high volume of manifests

2024-04-18 Thread Pucheng Yang
Hi community, We are seeing Spark Iceberg table metadata deletion consuming very high diver memory and seems causing s3 client failures, I would like to present my findings and seek comments from community: - My table is a v1 table with 3k manifests, each manifest is around 20-30mb so in to

Re: Proposal for REST APIs for Iceberg table scans

2024-05-18 Thread Pucheng Yang
Hi all, I wonder if we have a ETA for this change? thanks On Wed, Jan 31, 2024 at 10:30 AM Chertara, Rahil wrote: > Sure, I can look into adding this to the spec. > Thanks to everyone for sharing their thoughts, appreciate it! > > > > *From: *Ryan Blue > *Reply-To: *"dev@iceberg.apache.org" >

Re: Pagination for List APIs in the REST spec

2024-05-18 Thread Pucheng Yang
Hi all, is there an ETA for this? thanks On Wed, Dec 20, 2023 at 6:03 PM Renjie Liu wrote: > I think if servers provide a meaningful error message on expiration >> hopefully, this would be a good first step in debugging. I think saying >> tokens should generally support O(Minutes) at least shou

Re: Pagination for List APIs in the REST spec

2024-05-20 Thread Pucheng Yang
You are right, thanks Jack. On Mon, May 20, 2024 at 8:06 AM Jack Ye wrote: > I believe this is already merged? > https://github.com/apache/iceberg/pull/9782 > > Best, > Jack Ye > > On Sat, May 18, 2024 at 4:06 PM Pucheng Yang > wrote: > >> Hi all, is there an

Iceberg Summit Video not accessible from non-registered users

2024-05-23 Thread Pucheng Yang
Hi, My co-workers who did not register for the iceberg summit would like to check the videos. However, it seems the registration is closed hence they are not able to access the videos. Is there a way to release the video in a way non-registered users can view it? Thanks Pucheng

Proposal to support cherrypick static overwrite

2024-05-28 Thread Pucheng Yang
Hi community, My client is looking for the support of cherrypick static partition overwrite. Based on my understanding, the reason we can not do it is because we do not preserve static overwrite filters. I would like to make a proposal to support cherrypick static overwrite: 1. We will allow user

Re: Proposal to support cherrypick static overwrite

2024-05-30 Thread Pucheng Yang
Hi community, I would like to follow up on this proposal and would like to check if anyone has concerns about the proposed implementation from a high level perspective? Thank you very much Best, Pucheng On Tue, May 28, 2024 at 9:02 PM Pucheng Yang wrote: > Hi community, > > My

Re: Proposal to support cherrypick static overwrite

2024-05-30 Thread Pucheng Yang
; Pucheng, > > I am not sure about others. At least I had some hard time understanding > what the problem/proposal is. What is "cherrypick static partition > overwrite"? > > Thanks, > Steven > > On Thu, May 30, 2024 at 11:59 AM Pucheng Yang > wrote: > >

Re: [Proposal] REST Spec: Server-side Metadata Tables

2024-07-04 Thread Pucheng Yang
Hi all, regarding the "big metadata" issue, my understanding is even for Plan/Preplan API in the task planning use case, it will still have the same issue when the engine is doing a full table scan for large tables. Is my understanding correct? Also, given metadata compute could be heavy, do we co

Re: Spark: Copy Table Action

2024-07-08 Thread Pucheng Yang
Thanks for picking this up, I think this is a very valuable addition. On Mon, Jul 8, 2024 at 10:48 AM Yufei Gu wrote: > Hi folks, > > I'd like to share a recent progress of adding actions to copy tables > across different places. > > There is a constant need to copy tables across different place

Re: Building with JDK 21

2024-07-09 Thread Pucheng Yang
What does dropping Java 8 support mean to companies that are still using Java 8 for Iceberg in production? On Tue, Jul 9, 2024 at 9:26 AM Ryan Blue wrote: > +1 for removing Java 8 support. > > On Tue, Jul 9, 2024 at 9:24 AM Russell Spitzer > wrote: > >> The different formatting preferences soun

Re: Building with JDK 21

2024-07-09 Thread Pucheng Yang
a versions. > And as this thread shows, supporting old Java versions and new Java > versions at the same time becomes challenging. > Do you maybe know how big of the impact on community would dropping Java 8 > have? Some estimate on percentage of install base? > > Best, > Piotr &

Re: Spark: Copy Table Action

2024-07-11 Thread Pucheng Yang
e path debate: >>> - I have seen the relative path requirement coming up multiple times in >>> the past. Seems like a feature requested by multiple users, so I think it >>> would be the best to discuss it in a different thread. The Copy Table >>> Action might be

Java String to Expression Util?

2024-07-25 Thread Pucheng Yang
Hi dev community, If I read the codebase correctly, there seems to be no utility for converting String to an Expression, anyone be interested if we have one? I can help if the community thinks this is a good addition, I will need some guidance on how to get this started though. Thanks

Re: Java String to Expression Util?

2024-07-25 Thread Pucheng Yang
bf684f258f9/core/src/main/java/org/apache/iceberg/expressions/ExpressionParser.java#L262-L264 > > > Thanks, > Steve Zhang > > > > On Jul 25, 2024, at 11:54 AM, Pucheng Yang > wrote: > > Hi dev community, > > If I read the codebase correctly, there seems to be n

Re: Java String to Expression Util?

2024-07-25 Thread Pucheng Yang
The motivation is that users usually need to define what partitions/ files to select in some DSL, where the filter is expressed as a form of the string, but the underlying engine is Java based. On Thu, Jul 25, 2024 at 2:07 PM Pucheng Yang wrote: > Hi Steve, > > Thanks for sharing, unfo

Re: Java String to Expression Util?

2024-07-25 Thread Pucheng Yang
They may leverage Calcite for that. >> >> On Thu, Jul 25, 2024 at 2:16 PM Pucheng Yang >> wrote: >> >>> The motivation is that users usually need to define what partitions/ >>> files to select in some DSL, where the filter is expressed as a form of

[Feature Proposal] metadata location provider

2022-08-08 Thread Pucheng Yang
Hi all, I wonder how you like the idea of having a MetadataLocationProvider (similar to LocationProvider for data files)? This is the ticket https://github.com/apache/iceberg/issues/4583 Our use case is about storing users' Iceberg table metadata in a pre-designated location. We know that there i

Reverting a commit in the table history?

2022-09-29 Thread Pucheng Yang
Hi all, I wonder if any discussion happened about the idea of reverting a commit in the table history? My clients have such a use case: they are writing some data into a partition, and later want to revert that. But since there are new snapshots generated, thus they can not use snapshot rollback.

Re: Reverting a commit in the table history?

2022-09-29 Thread Pucheng Yang
on conflict and then reapply the changes. > The logic seems to be similar as what you want, to rollback to that > specific snapshot and try to reapply the ones you still want. > > > > Best, > > Jack Ye > > > > *From: *Pucheng Yang > *Reply-To: *"dev@icebe

Re: Reverting a commit in the table history?

2022-09-30 Thread Pucheng Yang
changes from the commit and > reverse them. We'd want to start small because reverting the file-level > changes isn't always the same thing as reverting the semantic changes. But > for simple cases like an append commit, it would work just fine. > > Ryan > > On Th

Re: Reverting a commit in the table history?

2022-09-30 Thread Pucheng Yang
C and if the row was in A or B then it > would bring back a deleted row. > > For this, we probably also need to know the original filter so that we can > check for certain conflicts. Right now, that’s not stored anywhere. But we > could start adding it to Snapshot metadata. > >

Re: Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Pucheng Yang
d. > (Particularly the Spark procedure rewrite_position_deletes and selecting > _partition metadata column). It is possible there's other places not fixed > yet. Do you have concrete examples of broken functionality on the latest > code? > > Thanks > Szehon > > On

Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Pucheng Yang
Hey community, I was following https://github.com/apache/iceberg/issues/9220 (Max number of columns) and down the rabbit hole and I found there are a lot of discussions about issues with tables having more than 1k columns. However, after reviewing discussions, it is still a little confusing to me

Re: Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Pucheng Yang
Thank you very much! I will try to document this on the website. On Thu, Oct 31, 2024 at 6:06 PM Szehon Ho wrote: > Yes, that is correct! > > Thanks > Szehon > > On Thu, Oct 31, 2024 at 5:58 PM Pucheng Yang > wrote: > >> Hi Szehon, >> >> Thanks for

Re: [DISCUSS] Enforce table properties at catalog level

2024-11-27 Thread Pucheng Yang
I think the naming of the property should be fixed as it only applies for any new table creation. On Wed, Nov 27, 2024 at 2:21 AM Manu Zhang wrote: > Hi all, > > Currently, we can *enforce default table properties* at catalog level > with configs like > spark.sql.catalog.*catalog-name*.table-ove

Re: [Iceberg Summit 25] Registration is live!

2025-02-03 Thread Pucheng Yang
Hi all, I am looking forward to the summit. Quick question, there is a promo code section in the registration, and I wonder if there is any promotion code we can use to further reduce fees? Best Regards, Pucheng On Wed, Jan 29, 2025 at 12:40 PM Danica Fine wrote: > Hi Kevin, > > We'll likely h

Re: Spark: Copy Table Action

2025-02-20 Thread Pucheng Yang
ch where the scheme and cluster is >> minted by the catalog, to be used in the respective FileIO implementation >> for the blob stores. For example, if we had a bucket foo on us-east, and >> bucket bar on us-west, the catalog running on us-east would mint s3://foo, >> and

Re: Restrict orphan file removal to data/metadata directories

2025-02-26 Thread Pucheng Yang
Yes, Iceberg spec does not define where the data and metadata should be located. /data and /metadata are paths by default, but users can override this behavior by having customized location provider or set write.metadata.path explicitly. On Wed, Feb 26, 2025 at 1:24 PM karuppayya wrote: > Hello