Re: [ANNOUNCE] Welcome Prashant Singh as a new Apache Iceberg Committer

2025-07-22 Thread Szehon Ho
Congratulations Prashant! Thanks for all the contribution! Szehon On Tue, Jul 22, 2025 at 10:29 AM Steve wrote: > Congratulations, Prashant! Great work! > > > On Tue, Jul 22, 2025 at 10:28 AM Kevin Liu wrote: > >> Congratulations, Prashant!! Well deserved. Thanks for all the >> contributions.

Re: Iceberg 1.10.0 release update - July 1, 2025

2025-07-03 Thread Szehon Ho
ady approved it. Can we get a couple more approvals?On Thu, Jul 3, 2025 at 1:37 PM Szehon Ho <szehon.apa...@gmail.com> wrote:Hi StevenThanks.  One more, what do we think about having https://github.com/apache/iceberg/pull/13106/ as part of 1.10 release?  It's migrating Spark procedure to use

Re: Iceberg 1.10.0 release update - July 1, 2025

2025-07-03 Thread Szehon Ho
. Treating the long > literal always as micro (current behavior) is not a correctness bug. It is > an important issue to be fixed so that engines can support nano timestamp > literal without going through the hoops. > > Thanks, > Steven > > On Wed, Jul 2, 2025 at 11:53 AM Szeh

Re: Iceberg 1.10.0 release update - July 1, 2025

2025-07-02 Thread Szehon Ho
Thanks Steven for driving the release. I like to get in one more bug fix: https://github.com/apache/iceberg/pull/13448, it is a backport of https://github.com/apache/iceberg/pull/13435 (merged by Amogh) as I missed to do Spark 3.4, so also should be close. Thanks Szehon On Wed, Jul 2, 2025 at

Re: [DISCUSS] v4 - Improved column statistics

2025-06-02 Thread Szehon Ho
+1 , excited for this one too, we've seen the current metrics maps blow up the memory and hope can improve that. On the Geo front, this could allow us to add supplementary metrics that don't conform to the geo type, like S2 Cell Ids. Thanks Szehon On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfne

Re: [DISCUSS] v4 - One file commits

2025-05-29 Thread Szehon Ho
Look forward to when Iceberg can move on a bit from its name, to handle slightly faster data. Interested as well to follow along, if I can ! Do we plan to store this files in columnar format as well? > Is that the other thread? https://lists.apache.org/thread/phdo75zmt8j9r44ngd7vdhtxqq63yxsp Tha

Re: [VOTE] Adopt the v3 spec changes

2025-05-19 Thread Szehon Ho
+1 (binding) Thanks, it's an exciting step for Iceberg! Szehon On Mon, May 19, 2025 at 4:03 PM Jia Yu wrote: > This is exciting! > > +1 (non-binding) > > On Mon, May 19, 2025 at 3:27 PM Ryan Blue wrote: > >> +1 (binding) >> >> I’ve gone through the changes in detail and I’m confident that they

Re: [VOTE] Clarify writer requirements in the spec to prevent orphan DVs

2025-05-14 Thread Szehon Ho
+1 (binding) Thanks Szehon On Wed, May 14, 2025 at 10:15 AM Fokko Driesprong wrote: > +1 (binding) > > Op wo 14 mei 2025 om 19:14 schreef Amogh Jahagirdar <2am...@gmail.com>: > >> +1 (binding) >> >> On Wed, May 14, 2025 at 11:09 AM Ryan Blue wrote: >> >>> +1 (binding) >>> >>> Thanks, Anton! >>

Re: [VOTE] Merge details about GZip metadata files to the spec.

2025-05-12 Thread Szehon Ho
+1 (binding) Thanks Szehon On Mon, May 12, 2025 at 9:19 AM Russell Spitzer wrote: > +1 (binding) > > On Mon, May 12, 2025 at 5:32 AM Eduard Tudenhöfner < > etudenhoef...@apache.org> wrote: > >> +1 (binding) >> >> On Mon, May 12, 2025 at 3:45 AM Gang Wu wrote: >> >>> +1 (non-binding) >>> >>> On

Re: [VOTE] Minor clarification for Geo Spec

2025-05-09 Thread Szehon Ho
t;> wrote: >>>>>> >>>>>>> +1 (non-binding) >>>>>>> >>>>>>> On May 6, 2025, at 2:53 PM, Ryan Blue wrote: >>>>>>> >>>>>>> +1 (binding) >>>>>>> >>>

Re: [VOTE] Minor clarification for Geo Spec

2025-05-09 Thread Szehon Ho
The vote passes with 7 binding +1's and 4 non-binding +1's. Thanks everyone for voting! Szehon On Fri, May 9, 2025 at 3:26 PM Szehon Ho wrote: > +1 (binding) > > Thanks > Szehon > > On Wed, May 7, 2025 at 10:41 AM huaxin gao wrote: > >> +1 (non-binding)

[VOTE] Minor clarification for Geo Spec

2025-05-06 Thread Szehon Ho
Hi everyone, As discussed briefly in https://lists.apache.org/thread/ncj0xjh2ct5xvovn4tzc45lkm1wbmorq, there is a minor clarification for geo type bounds that we want to get in for finalizing V3 spec. We want to clarify the behavior of null/NaN coordinate values in geo objects. There can be many

Re: [DISCUSS] Finalizing the v3 spec

2025-04-30 Thread Szehon Ho
Hi Jia I feel it would be nice to get that Parquet spec clarificiation https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec as well, once we finalize that. Thanks Szehon On Tue, Apr 29, 2025 at 10:55 PM Jia Yu wrote: > Hi Szehon, > > Thanks for clarifying it. > > We’re curren

Re: [VOTE] Add encryption keys to table metadata

2025-04-30 Thread Szehon Ho
+1 Thanks Szehon On Wed, Apr 30, 2025 at 4:10 AM Eduard Tudenhöfner wrote: > +1 (binding) > > On Tue, Apr 29, 2025 at 9:29 PM Ryan Blue wrote: > >> Hi everyone, >> >> I’d like to propose merging PR 12162 >> into the table spec >> for v3. The

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Szehon Ho
Hi Jia I think its about the spec, and not the implementation (which is definitely good to reduce risk to need to change the spec). We actually wanted to get our Parquet reader/writer out for this effort, but as we see, it seems it depends on next Parquet-java release for the new Geo types on Par

Re: [VOTE] Update row lineage spec ID assignment

2025-04-17 Thread Szehon Ho
+1 (binding) Seems cleaner to me. Thanks Szehon On Thu, Apr 17, 2025 at 10:31 AM Russell Spitzer wrote: > +1 > > On Thu, Apr 17, 2025 at 12:30 PM Ryan Blue wrote: > >> Adding my own +1. >> >> On Thu, Apr 17, 2025 at 10:19 AM Daniel Weeks wrote: >> >>> +1 (binding) >>> >>> I think this update

Re: [VOTE] Simplify multi-argument field-id(s) encoding

2025-04-17 Thread Szehon Ho
+1 (binding) Thanks Szehon On Thu, Apr 17, 2025 at 10:18 AM Daniel Weeks wrote: > +1 (binding) > > On Thu, Apr 17, 2025 at 8:41 AM Russell Spitzer > wrote: > >> +1 (Bind) >> >> On Thu, Apr 17, 2025 at 8:14 AM Jean-Baptiste Onofré >> wrote: >> >>> +1 (non binding) (as said in the PR :)) >>> >>

Re: [VOTE] Row lineage required for v3

2025-03-31 Thread Szehon Ho
+1 (binding) Thanks Szehon On Mon, Mar 31, 2025 at 8:53 PM huaxin gao wrote: > +1 (non-binding) > > On Mon, Mar 31, 2025 at 7:44 PM Renjie Liu > wrote: > >> +1 >> >> On Tue, Apr 1, 2025 at 10:33 AM Denny Lee wrote: >> >>> +1 (non-binding) >>> >>> On Mon, Mar 31, 2025 at 7:27 PM roryqi wrote:

Re: [VOTE] Minor simplifications for Geo Spec

2025-03-23 Thread Szehon Ho
) Therefore, the vote passes. Szehon On Sun, Mar 23, 2025 at 5:47 PM Szehon Ho wrote: > +1 > > On Sat, Mar 22, 2025 at 10:42 PM huaxin gao > wrote: > >> +1 (non-binding) >> >> On Sat, Mar 22, 2025 at 6:32 PM Prashant Singh >> wrote: >> >>> +1 (non b

Re: [VOTE] Minor simplifications for Geo Spec

2025-03-23 Thread Szehon Ho
+1 On Sat, Mar 22, 2025 at 10:42 PM huaxin gao wrote: > +1 (non-binding) > > On Sat, Mar 22, 2025 at 6:32 PM Prashant Singh > wrote: > >> +1 (non binding) >> >> Best, >> Prashant >> >> On Fri, Mar 21, 2025 at 10:03 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> +1 (bind >>> >>

[VOTE] Minor simplifications for Geo Spec

2025-03-18 Thread Szehon Ho
Hi everyone, While working on the reference implementation for Geometry/Geography spec, we noticed some parts that can be simplified for this first version: 1. Default values should always be null (requires WKT serialization logic, for not many real world use cases) 2. JSON type serializ

Re: Restrict orphan file removal to data/metadata directories

2025-03-05 Thread Szehon Ho
Hi Karuppayya Wanted to check, would a regex suffice for this use case (ie, match /data/*, /metadata/*) and to keep it more general ? The idea came from Dan in a one off chat. Thanks Szehon On Wed, Feb 26, 2025 at 1:41 PM Pucheng Yang wrote: > Yes, Iceberg spec does not define where the data

Re: [VOTE] Java implementation notes around current-snapshot-id

2025-02-24 Thread Szehon Ho
+1 Thanks Szehon On Mon, Feb 24, 2025 at 2:52 PM rdb...@gmail.com wrote: > +1 > > On Mon, Feb 24, 2025 at 12:26 PM Daniel Weeks wrote: > >> +1 >> >> On Mon, Feb 24, 2025, 11:00 AM Russell Spitzer >> wrote: >> >>> +1 >>> >>> On Mon, Feb 24, 2025 at 12:55 PM Fokko Driesprong >>> wrote: >>> >>>

Re: Spark: Copy Table Action

2025-02-20 Thread Szehon Ho
Hi Thanks to Steve Zhang, we have a doc now of how to use RewriteTablePaths as part of table replication (hot off the nightly doc build): https://iceberg.apache.org/docs/nightly/spark-procedures/#table-replication. You can use it in like: - RegisterTable , returns CopyPlan and lastVersionFileN

Re: [VOTE] Add overwriteRequested to RegisterTableRequest in REST spec

2025-02-13 Thread Szehon Ho
+1 Thanks Steve! Szehon On Thu, Feb 13, 2025 at 1:23 PM Yufei Gu wrote: > +1 (binding) > Yufei > > > On Thu, Feb 13, 2025 at 1:20 PM huaxin gao wrote: > >> +1 (non-binding) >> >> On Thu, Feb 13, 2025 at 11:51 AM Anurag Mantripragada >> wrote: >> >>> +1 (non-binding) >>> >>> Thanks, Steve! >>>

Re: [VOTE] Add Geometry and Geography types for V3

2025-02-10 Thread Szehon Ho
at 11:42 AM Szehon Ho wrote: > Here is my +1 (binding) > > Thanks > Szehon > > On Mon, Feb 10, 2025 at 12:47 AM Eduard Tudenhöfner < > etudenhoef...@apache.org> wrote: > >> +1 >> >> On Sat, Feb 8, 2025 at 1:02 PM Fokko Driesprong wrote: >>

Re: [VOTE] Add Geometry and Geography types for V3

2025-02-10 Thread Szehon Ho
t;>>>> +1 >>>>>> >>>>>> Best regards, >>>>>> Honah >>>>>> >>>>>> On Fri, Feb 7, 2025 at 10:45 AM Aihua Xu wrote: >>>>>> >>>>>>> +1 (non-bi

Re: [VOTE] Simplify multi-arg table metadata

2025-02-09 Thread Szehon Ho
+1 (binding) Thanks Fokko! Szehon > On Feb 9, 2025, at 8:14 AM, Jean-Baptiste Onofré wrote: > > +1 (non binding) > > Thanks to the cat :) > > Regards > JB > >> On Sun, Feb 9, 2025 at 10:01 AM Fokko Driesprong wrote: >> >> (Second attempt, the cat ran over the keyboard) >> >> Hey everyone

Re: [DISCUSS] Simplify multi-arg table metadata

2025-02-08 Thread Szehon Ho
Missed the thread, but with v3 now much closer than before, agree that the gain is not worth the risk. Thanks! Szehon On Fri, Feb 7, 2025 at 11:52 PM Xianjin Ye wrote: > +1. I think it's good timing to allow multi-arg transform for V3 and > onwards only. > > On 2025/02/03 18:26:00 "Driesprong,

[VOTE] Add Geometry and Geography types for V3

2025-02-06 Thread Szehon Ho
Hi everyone We would like to add Geometry and Geography types to the Iceberg V3 spec: https://github.com/apache/iceberg/pull/10981 This is proposed together with Apache Parquet format change to support geospatial data. https://github.com/apache/parquet-format/pull/240 This vote will be open fo

Welcome Huaxin Gao as a committer!

2025-02-06 Thread Szehon Ho
Hi everyone, The Project Management Committee (PMC) for Apache Iceberg has invited Huaxin Gao to become a committer, and I am happy to announce that she has accepted. Huaxin has done a lot of impressive work in areas such as Iceberg-Spark integration and recently Iceberg-Comet integrations. Thank

Re: [VOTE] Document Snapshot Summary Optional Fields as Subsection of Appendix F in Spec

2025-01-21 Thread Szehon Ho
+1 (binding) Thanks Szehon On Tue, Jan 21, 2025 at 12:55 PM Yufei Gu wrote: > +1 Thanks Honah! > > Yufei > > > On Tue, Jan 21, 2025 at 12:45 PM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> +1 >> >> On Tue, Jan 21, 2025 at 2:36 PM rdb...@gmail.com >> wrote: >> >>> +1 >>> >>> On Tu

Re: New committer: Matt Topol

2024-12-10 Thread Szehon Ho
Congratulations, Matt! Thanks for the perseverance on Iceberg-Go Szehon On Tue, Dec 10, 2024 at 10:02 AM Bryan Keller wrote: > Congrats! > > -Bryan > > On Dec 10, 2024, at 7:37 AM, Matt Topol wrote: > > Thanks everyone! > > On Tue, Dec 10, 2024 at 9:26 AM Gang Wu wrote: > >> Congrats Matt! >>

Re: [Discuss] Geospatial Support

2024-12-06 Thread Szehon Ho
itzer < >> russell.spit...@gmail.com> wrote: >> >>> All my concerns are addressed, I'm ready to vote. >>> >>> On Mon, Sep 30, 2024 at 1:21 PM Szehon Ho >>> wrote: >>> >>>> Hi all, >>>> >>>> There have

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Szehon Ho
> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho > wrote: > >> Hm I think the thread got a bit sidetracked by the other question. >> >> The initial proposal by Steve is a performance improvement for >> HiveCatalog's tableExists(). Currently it loads both Hive a

Re: [Discuss] Document Snapshot Summary Optional Fields for Standardization

2024-11-27 Thread Szehon Ho
This makes sense to me generally, I've tried a few times to search in the spec to find a list of possible snapshot summary properties, and was a bit surprised to not find them there. So I think this would be a nice addition. I'm curious if there's any historical reason it's not been included in t

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Szehon Ho
> existing behavior for this part >>>>> >>>>> +1. I realized that this is not a new behavior. The `loadTable` >>>>> implementation has this problem too. >>>>> It would be good to have a test case specifically for this edge case >>&

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Szehon Ho
Should add, my personal preference is probably not to change the existing behavior for this part (false, if exists a Hive table with same name) at the moment, just adding another possibility for consideration. Thanks Szehon On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho wrote: > Thanks Kevin

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Szehon Ho
>> because `HiveOperationsBase.validateTableIsIceberg` throws a >> `NoSuchTableException`. >> This would cause the code above to attempt to create the table, only to >> fail since the name already exists in the HMS. >> If `tableExists` is meant to check for conflicti

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-21 Thread Szehon Ho
Hi, It's a good performance find and improvement. Left some comment on the PR. IMO, the behavior actually more matches the API javadoc ("Check whether table exists"), not whether it is corrupted or not, so I'm supportive of it. Thanks Szehon On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang wrote

Re: [DISCUSS] Deprecate embedded manifests

2024-11-21 Thread Szehon Ho
+1, great to have less possible paths. Thanks Szehon On Thu, Nov 21, 2024 at 10:33 AM Steve Zhang wrote: > +1 to deprecate > > Thanks, > Steve Zhang > > > > On Nov 19, 2024, at 3:32 AM, Fokko Driesprong wrote: > > Hi everyone, > > I would like to propose to deprecate embedded manifests >

Re: [DISCUSS] Partial Metadata Loading

2024-11-05 Thread Szehon Ho
There seems to be many opinions here, but one of the main objections seems to be the complexity added to REST spec impeding newer catalogs. Looking through the actual REST API change proposal

Re: Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Szehon Ho
le issues but they are generally fixable. > > Is my understanding correct? > > Thanks again for your quick response. > > On Thu, Oct 31, 2024 at 5:50 PM Szehon Ho wrote: > >> Hi Pucheng >> >> There were some parts in the implementation where column field ids

Re: Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Szehon Ho
Hi Pucheng There were some parts in the implementation where column field ids collided with partition field ids. https://github.com/apache/iceberg/pull/10020 introduced mechanisms for affected code to get unique ids, and known places have been fixed. (Particularly the Spark procedure rewrite_posit

Re: [VOTE] Deletion Vectors in V3

2024-10-30 Thread Szehon Ho
-0 Great work and exciting functionality, but transferring concerns from the other thread about the decision. Thanks Szehon On Wed, Oct 30, 2024 at 9:12 AM Steven Wu wrote: > +1 > > On Wed, Oct 30, 2024 at 1:07 AM xianjin wrote: > >> +1 (non binding) >> >> On Wed, Oct 30, 2024 at 2:28 PM Jean

Re: [Discuss] Iceberg View Interoperability

2024-10-25 Thread Szehon Ho
he limitations > with the SQL representation (like not wanting to rewrite after resolution). > > On Fri, Oct 25, 2024 at 10:21 AM Szehon Ho > wrote: > >> Im dont have hands on experience on Substrait, but wondering, is >> substrait representation possible today with existi

Re: [Discuss] Iceberg View Interoperability

2024-10-25 Thread Szehon Ho
Im dont have hands on experience on Substrait, but wondering, is substrait representation possible today with existing Iceberg view spec? Ie, engines can store today the text serialized substrait representation with sql dialect 'substrait'? Or is it an abuse of spec and we should make a proper f

Re: [PROPOSAL] Add manifest-level statistics for CBO estimation

2024-10-24 Thread Szehon Ho
Hi Im just wondering, is a solution to put these stats in Puffin files? There's already ComputeTableStatsSparkAction (and probably similar actions in other engines), and I can imagine a quick metadata aggregation job to compute min/max/null_values, etc. Also how accurate would we need the stats? T

Re: Spec changes for deletion vectors

2024-10-21 Thread Szehon Ho
um" for each option. Because we want to >>> be able to seek directly to the DV for a particular data file, I think it's >>> important to start the blob with magic bytes. That way the reader can >>> validate that the offset was correct and that the contents of the

Re: Spec changes for deletion vectors

2024-10-17 Thread Szehon Ho
rds compatibility by adding "reader" support for Delta Lake >>>>> DVs >>>>> in the spec, but not "writer support". >>>>> >>> d. Go forward with the current proposal but use offset and length >>>>

Re: Spec changes for deletion vectors

2024-10-15 Thread Szehon Ho
This is awesome work by Anton and Ryan, it looks like a ton of effort has gone into the V3 position vector proposal to make it clean and efficient, a long time coming and Im truly excited to see the great improvement in storage/perf. wrt to these fields, I think most of the concerns are already me

Re: [Discuss] Geospatial Support

2024-09-30 Thread Szehon Ho
and team are also volunteering to work on the prototype immediately afterwards. Thank you, Szehon On Tue, Aug 20, 2024 at 1:57 PM Szehon Ho wrote: > Hi all > > Please take a look at the proposed spec change to support Geo type for V3 > in : https://github.com/apache/iceberg/pul

Re: [DISCUSS] Action to Rewrite Equality Deletes as Position Deletes

2024-09-13 Thread Szehon Ho
+1, Id be happy to see this feature. Thanks Szehon On Fri, Sep 13, 2024 at 10:33 AM Prashant Singh wrote: > Hi All, > > Starting this thread to revive the discussion on converting Equality > Deletes as Position deletes and see if this is something community wants > now (Happy to contribute in t

Re: [Discuss] Geospatial Support

2024-08-20 Thread Szehon Ho
). Thanks, Szehon On Wed, Jun 26, 2024 at 7:29 PM Szehon Ho wrote: > Hi > > It was great to meet in person with Snowflake engineers and we had a good > discussion on the paths forward. > > Meeting notes for Snowflake- Iceberg sync. > >- Iceberg proposed Geometry type d

Re: Welcome Péter, Amogh and Eduard to the Apache Iceberg PMC

2024-08-13 Thread Szehon Ho
Congratulations all, very well deserved! Thanks Szehon On Tue, Aug 13, 2024 at 10:25 PM Russell Spitzer wrote: > Hi Y'all, > > It is my pleasure to let everyone know that the Iceberg PMC has voted to > have several talented individuals join us. > > So without further ado, please welcome Péter V

Re: [DISCUSS] adoption of format version 3

2024-08-06 Thread Szehon Ho
10:19 PM Micah Kornfield < >>>>> emkornfi...@gmail.com> wrote: >>>>> >>>>>> It sounds like most of the opinions so far are waiting for the scope >>>>>> of work to finish before finalizing the specification. >>>>>>

Re: [DISCUSS] adoption of format version 3

2024-07-31 Thread Szehon Ho
Sorry I missed the sync this morning (sick), I'd like to push for geo too. I think on this front as per the last sync, Ryan recommended to wait for Parquet support to land, to avoid having two versions on Iceberg side (Iceberg-native vs Parquet-native). Parquet support is being actively worked on

Re: [DISCUSS] Guidelines for committing PRs

2024-07-29 Thread Szehon Ho
t 1:53 PM Szehon Ho wrote: > Hi, > > Also if I read it correctly, I think this proposal imposes the following > workflows in "spec" folders : > >1. Large and functional changes. These redirect to Iceberg >improvement proposals, which ends in code-modi

Re: [DISCUSS] Guidelines for committing PRs

2024-07-29 Thread Szehon Ho
Hi, Also if I read it correctly, I think this proposal imposes the following workflows in "spec" folders : 1. Large and functional changes. These redirect to Iceberg improvement proposals, which ends in code-modification vote 2. bug-fixes or clarification, which is specified to require

Re: [VOTE] Drop Java 8 support in Iceberg 1.7.0

2024-07-26 Thread Szehon Ho
+1 (binding) Thanks Szehon On Fri, Jul 26, 2024 at 8:55 AM Steven Wu wrote: > +1 (binding) > > I would also suggest keeping the vote open for 7 days for a larger > decision like this. > > > On Fri, Jul 26, 2024 at 8:50 AM Ryan Blue > wrote: > >> +1 >> >> On Fri, Jul 26, 2024 at 8:42 AM Russell

Re: Dropping JDK 8 support

2024-07-22 Thread Szehon Ho
+1 for dropping JDK 8 in Iceberg 2.0. I also wonder the same thing as Huaxin (sorry if I missed a previous thread on Iceberg 2.0 plan). Also as Huaxin has discovered in Spark 4.0 Support PR , looks like we may have to drop Java8 first in Spark 4.0 mod

Re: Building with JDK 21

2024-07-22 Thread Szehon Ho
Thanks Piotr for driving this, late +1 to add JDK 21 support and your plan for spotless. It seems ok to me too to bite the bullet and move to newer spotless (disabling spotless for JDK8 builds) post 1.6, but looks like the discussion happened and I'm fine either way. Thanks! Szehon On Mon, Jul 2

Re: [DISCUSS] DROP PARTITION in Spark

2024-07-17 Thread Szehon Ho
Hi Gabor I'm neutral for this, but can be convinced. My initial thoughts is that there would be no way to have ADD PARTITION (I assume old Hive workloads would rely on this), and these are not ANSI SQL standard statements as Spark moves to that direction. The second point of guaranteeing a metad

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-16 Thread Szehon Ho
lumns). >>> >>> How long are we going to keep the expired snapshot references by >>> default? If it is months/years, it can have major implications on the query >>> performance of metadata tables (like snapshots, all_*). >>> >>> I assume it will also have

Re: [VOTE] spec: remove the JSON spec for content file and file scan task sections

2024-07-11 Thread Szehon Ho
+1 Thanks Szehon On Thu, Jul 11, 2024 at 11:02 AM Daniel Weeks wrote: > +1 (binding) > > On Thu, Jul 11, 2024 at 10:54 AM Anurag Mantripragada > wrote: > >> +1 (non-binding) .Thanks Steve >> >> >> Anurag Mantripragada >> >> On Jul 11, 2024, at 10:27 AM, Yufei Gu wrote: >> >> +1 (binding) Than

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2024-07-09 Thread Szehon Ho
e work on supporting DELETE/UPDATE/MERGE in > the DataFrame API? > Thanks, > Wing Yew > > > On Tue, Jul 9, 2024 at 10:05 PM Szehon Ho wrote: > >> Hi, >> >> Just FYI, good news, this change is merged on the Spark side : >> https://github.com/apache/spark/p

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2024-07-09 Thread Szehon Ho
Hi, Just FYI, good news, this change is merged on the Spark side : https://github.com/apache/spark/pull/46707 (its the third effort!). In next version of Spark, we will be able to pass read properties via SQL to a particular Iceberg table such as SELECT * FROM iceberg.db.table1 WITH (`locality`

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread Szehon Ho
t even want to know >> of. If one can expire a snapshot from the middle of the history, that would >> be nice, so users would see only S1/S2/S4. The only downside is that >> reading S2 is less performant than reading S3, but IMHO this could be >> acceptable for having onl

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread Szehon Ho
implementations. Also, the type >>> of metadata tracked can differ depending on the use case. For example, >>> while LakeChime retains partition and operation type metadata, it does not >>> track file-level metadata as there was no specific use case for that. >>&

[DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-05 Thread Szehon Ho
Hi folks, I would like to discuss an idea for an optional extension of Iceberg's Snapshot metadata lifecycle. Thanks Piotr for replying on the other thread that this should be a fuller Iceberg format change. *Proposal Summary* Currently, ExpireSnapshots(long olderThan) purges metadata and delet

Re: [Proposal] REST Spec: Server-side Metadata Tables

2024-07-03 Thread Szehon Ho
file removal without removing all the snapshot > information yet. > Please help my understand the reasoning behind these tradeoffs. > > Best > PF > > > > > On Thu, 4 Jul 2024 at 02:26, Szehon Ho <mailto:szehon.apa...@gmail.com>> wrote: >> Yes, I was ch

Re: [Proposal] REST Spec: Server-side Metadata Tables

2024-07-03 Thread Szehon Ho
Yes, I was chatting with Yufei about this, in the first glance I agree this would be nice to have. I always thought that metadata tables are important enough to spec somewhere, and I think this is a nice place to do it. There seems to be some overlap with existing calls (ie, you can get snapshots

Re: [Discuss] Geospatial Support

2024-06-26 Thread Szehon Ho
ored as a string, Iceberg cannot read it. This should be ok, as we only need this for XZ2 transform, where the user already passes in the info from CRS (up to user to make sure these align). Thanks Szehon On Tue, Jun 18, 2024 at 12:23 PM Szehon Ho wrote: > Jia and I will sync with t

Re: Feedback Collection: Bylaws in Iceberg

2024-06-24 Thread Szehon Ho
Hi Also copying my previous response in private. Hi > Thanks Jack for taking the time for this doc. While the Iceberg community > and PMC so far has been one of the most collaborative, and I have > personally the utmost respect for those that laid the groundwork without > which we would not be h

Re: Making the NDV property required for theta sketch blobs in Puffin

2024-06-21 Thread Szehon Ho
It makes sense to me, normally changing optional -> required would probably require a version bump, but maybe it is ok here as it is a relatively new format, afaik adapted by Trino which already sets this field, but let's see if anyone disagrees. Thanks Szehon On Fri, Jun 21, 2024 at 3:35 PM huax

Re: Agenda Community Sync 19th June

2024-06-18 Thread Szehon Ho
Hi guys, The sync is Juneteenth (US federal holiday), so I think some folks on this side may miss, FYI PS (at least from my side) one highlight is the longstanding 1k column bug is finally fixed (at least partially) in https://github.com/apache/iceberg/pull/10020 Thanks Szehon On Tue, Jun 18, 2

Re: [Discuss] Geospatial Support

2024-06-18 Thread Szehon Ho
ote: >> >>> > The min/max stats are discussed in the doc (Phase 2), depending on the >>> non-trivial encoding. >>> >>> Just want to add that min/max stats filtering could be supported by file >>> format natively. Adding geometry type to parquet spec >>

Re: [Discuss] Geospatial Support

2024-06-05 Thread Szehon Ho
not many libs >> can parse projjson. >> >> @Szehon Is there a way that we can support both SRID and PROJJSON in Geo >> Iceberg? >> >> It is also worth noting that, although there are many libs that can parse >> SRID and perform look-up in the EPSG database,

Re: [Discuss] Geospatial Support

2024-05-29 Thread Szehon Ho
two > columns from different data providers. > > To address this we would like to propose including the option to specify > the SRS with only a SRID in phase 1. The query engine may choose to treat > it as opaque identified or make a look-up in the EPSG database of > supported. >

Re: [Discuss] Heap pressure with RewriteFiles APIs

2024-05-21 Thread Szehon Ho
Hi Naveen Yes it sounds like it will help to disable metrics for those columns? Iirc, by default it manifest entries have metrics at 'truncate(16)' level for 100 columns, which as you see can be quite memory intensive. A potential improvement later also is to have the ability to remove counts by

Re: Materialized Views: Next Steps

2024-05-10 Thread Szehon Ho
apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45 > > Thanks, > Walaa. > > On Thu, May 9, 2024 at 11:30 PM Szehon Ho wrote: > >> Hi Walaa >> >> As there may be confus

Re: Materialized Views: Next Steps

2024-05-09 Thread Szehon Ho
ent/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4 > [2] > https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE > > Thanks, > Walaa. > > > On Thu, May 9, 2024 at 5:49 PM Szehon Ho wro

Re: Materialized Views: Next Steps

2024-05-09 Thread Szehon Ho
t by now. If we agree, we can continue the > discussion on the PR, else, we can create a doc. > > Thanks, > Walaa. > > > On Thu, May 9, 2024 at 4:39 PM Szehon Ho wrote: > >> Thanks Walaa for driving it forward, looking forward to thinking about >> implementation

Re: Materialized Views: Next Steps

2024-05-09 Thread Szehon Ho
Thanks Walaa for driving it forward, looking forward to thinking about implementation of Materialized Views. I see Jan's point, the PR spec change is similar but does not seem to be completely aligned with the Draft Spec in the design doc: https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6

[Discuss] Geospatial Support

2024-05-01 Thread Szehon Ho
Hi everyone, We have created a formal proposal for adding Geospatial support to Iceberg. Please read the following for details. - Github Proposal : https://github.com/apache/iceberg/issues/10260 - Proposal Doc: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt2

Re: [Proposal] Add support for Materialized Views in Iceberg

2024-04-22 Thread Szehon Ho
+1 for the approach given it reduces the work. On this, as it exposes storage tables to user catalog, I was mainly thinking we should have a common suffix/naming pattern for storage table across catalog. The netflix approach sounds good to me. Hope we can continue the proposal, as there's still

Re: [VOTE] Release Apache Iceberg 1.5.1 RC0

2024-04-22 Thread Szehon Ho
+1 (binding) * Verify signature * Verify checksum * Verify licenses * Build and run basic test with Spark 3.5 Thanks Szehon On Sun, Apr 21, 2024 at 11:45 PM Ajantha Bhat wrote: > +1 (non-binding) > > * validated checksum and signature > * checked license docs & ran RAT checks > * ran build and

Re: Materialized view integration with REST spec

2024-03-22 Thread Szehon Ho
s back? > > On Fri, Mar 22, 2024 at 10:35 AM Szehon Ho > wrote: > >> Hi >> >> My understanding was last time it was still unresolved, and the action >> item was on Jack and/or/ Jan to make a shorter document. I think the >> debate now has boiled down to Ryan&

Re: Materialized view integration with REST spec

2024-03-22 Thread Szehon Ho
n 6: New MV spec with table and view metadata >>>>>>>>>>>>> >>>>>>>>>>>>> I originally excluded option 2 because I think it does not >>>>>>>>>>>>> align

Re: New committer: Renjie Liu

2024-03-11 Thread Szehon Ho
Congratulations! On Mon, Mar 11, 2024 at 12:43 PM Jack Ye wrote: > Congratulations Renjie! > > Best, > Jack Ye > > On Mon, Mar 11, 2024, 8:24 AM Ryan Blue wrote: > >> Congratulations, Renjie! Thanks for all your contributions! >> >> On Mon, Mar 11, 2024 at 12:52 AM Eduard Tudenhoefner >> wrote

Re: [VOTE] Release Apache Iceberg 1.5.0 RC6

2024-03-08 Thread Szehon Ho
+1 (binding) * Verified signature * Verified checksum * RAT check * built JDK 11 * Ran basic tests on Spark 3.5 Thanks Szehon On Fri, Mar 8, 2024 at 5:50 PM Amogh Jahagirdar wrote: > +1 non-binding > > Verified signatures,checksums,RAT checks, build, and tests with JDK11. I > also ran ad-hoc t

Re: New committer: Bryan Keller

2024-03-05 Thread Szehon Ho
Congratulations Bryan, well deserved, great work on Iceberg ! On Tue, Mar 5, 2024 at 8:14 AM Jack Ye wrote: > Congrats Bryan! > > -Jack > > On Tue, Mar 5, 2024 at 7:33 AM Amogh Jahagirdar wrote: > >> Congratulations Bryan! Very well deserved, thank you for all your >> contributions! >> >> On Tu

Re: [VOTE] Release Apache Iceberg 1.5.0 RC4

2024-03-01 Thread Szehon Ho
+1 (binding) - Verified signature - Verified checksum - RAT check - Compiled - Manually ran basic queries on Spark 3.5 On Fri, Mar 1, 2024 at 6:13 AM Fokko Driesprong wrote: > +1 (binding) > > - Checked checksum and signature > - Ran a modified version of dbt-spark to take advantage of the view

Re: Materialized view integration with REST spec

2024-02-29 Thread Szehon Ho
Hi Yes I mostly agree with the assessment. To clarify a few minor points. is a materialized view a view and a separate table, a combination of the > two (i.e. commits are combined), or a new metadata type? For 'new metadata type', I consider mostly Jack's initial proposal of a new Catalog MV o

Re: Materialized view integration with REST spec

2024-02-22 Thread Szehon Ho
o keep these separate from discussions about single points >>>> so that they can be persisted in the document. >>> >>> >>> Not sure if it helpful, but I added voting chips Question 0, as maybe an >>> easier way to keep track of votes. If it is helpful

Re: Materialized view integration with REST spec

2024-02-21 Thread Szehon Ho
f we think >>> this format is not effective, I propose that we create a new mv channel in >>> Iceberg Slack workspace, and people interested can join and discuss all >>> these points directly. What do we think? >>> >>> Best, >>> Jack Ye >

Re: Materialized view integration with REST spec

2024-02-19 Thread Szehon Ho
Hi, Great to see more discussion on the MV spec. Actually, Jan's document "Iceberg Materialized View Spec" has been organized , with a "Design Questions" section to track these debates, and it would be nice to centr

Re: Spec change for multi-arg transform

2024-01-30 Thread Szehon Ho
Sorry I may have misunderstood the statement and maybe this is specific to multi-arg transform, in any case let's get a spec pr earlier in to discuss/specify behavior for V1-2 vs 3. Thanks Szehon On Tue, Jan 30, 2024 at 9:23 AM Szehon Ho wrote: > Thanks all for the discussion. >

Re: Spec change for multi-arg transform

2024-01-30 Thread Szehon Ho
ference is that >>>> for step 2, we typically just build one reference implementation in the >>>> Java library. We do vote on the large spec updates, but in this case you >>>> haven't seen one since we haven't built the reference implementation yet. >>&

Re: Spec change for multi-arg transform

2024-01-28 Thread Szehon Ho
Hi, This would not be retrofitting existing partition transforms, but just allowing for the creation of new multi-arg transforms. Is the concern that some implementations are never expecting new transforms to be added? Old implementations would indeed not be able to read Iceberg tables created w

  1   2   >