Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Szehon Ho
+1, it's an exciting step for Iceberg, look forward to all the new statistics and secondary indices it will allow. Had a few questions of what the reference to Puffin file(s) will be in the Iceberg spec, but it's orthogonal to Puffin file format itself. Thanks, Szehon On Thu, Jun 9, 2022 at 3:32

Re: [VOTE] Release Apache Iceberg 0.14.0 RC1

2022-07-15 Thread Szehon Ho
+1 (non-binding) - Verified signature - Verified checksum - Rat check - Could not find Apache license headers on iceberg-build.properties ( as mentioned by Ryan) - Ran tests - Same error mentioned by John: org.apache.iceberg.aws.s3.TestS3FileIO > testPrefixDel

Re: [DISCUSS] Automatic Code Formatting / Code Style / Enforcing Code Style

2022-07-29 Thread Szehon Ho
Thanks for the auto formatting initiative, I think its really a time saver. I also agree about the line length, it would be better to keep it at 120 and a bummer it has to be reduced to 100 now. Looking at palantir-format, I actually like some of their format choices like line-length and also not

Re: Welcome Fokko Driesprong as a committer!

2022-08-22 Thread Szehon Ho
Congratulations! Szehon On Mon, Aug 22, 2022 at 12:25 PM Péter Váry wrote: > Congratulations Fokko! > > On Mon, Aug 22, 2022, 16:37 Jahagirdar, Amogh > wrote: > >> Congratulations Fokko! >> >> >> >> *From: *Gabor Kaszab >> *Reply-To: *"dev@iceberg.apache.org" >> *Date: *Monday, August 22, 202

Re: Welcome Yufei Gu as a committer

2022-08-25 Thread Szehon Ho
Congratulations, Yufei! Thanks Szehon > On Aug 25, 2022, at 4:20 PM, Anton Okolnychyi > wrote: > > I’d like to welcome Yufei Gu as a committer to the project. > > Thanks for all your hard work, Yufei! > > - Anton

Re: [VOTE] Release Apache Iceberg 1.0.0 RC0

2022-10-10 Thread Szehon Ho
Hi, I get a NoClassDefFoundError from IcebergSparkExtensions when running Spark 3.3, with iceberg-spark-runtime-3.3_2.12-1.0.0.jar. I noticed this jar doesn't contain scala classes, unlike previous jars iceberg-spark-runtime-3.3_2.12-0.14.1.jar. scala> spark.sql("show databases").show java.lang.

Re: [VOTE] Release Apache Iceberg 1.0.0 RC0

2022-10-10 Thread Szehon Ho
:26 AM Szehon Ho wrote: > Hi, > > I get a NoClassDefFoundError from IcebergSparkExtensions when running > Spark 3.3, with iceberg-spark-runtime-3.3_2.12-1.0.0.jar. I noticed this > jar doesn't contain scala classes, unlike previous jars > iceberg-spark-runtime-3.3_2.1

Re: [DISCUSS] October board report

2022-10-12 Thread Szehon Ho
Turoczy, Bill Zhang) - Apache Iceberg's REST Catalog - A Gateway to Enriching Data Access via the Simplicity of an HTTP Service (Sam Redai) - Iceberg's Best Secret: Exploring Metadata Tables (Szehon Ho) - Integrated Audits: Streamlined Data Observability with Apache Iceberg (Sam

RemoveDanglingDeleteFile proposal

2022-11-04 Thread Szehon Ho
Hi all, I made a proposal about adding a Spark Procedure RemoveDanglingDeleleteFiles. It would do a more comprehensive job to remove Delete Files that stay around after they become invalid (stop applying to Data Files), which happens in some cases, taking up storage and potentially affecting per

Re: [VOTE] Release Apache Iceberg 1.1.0 RC2

2022-11-17 Thread Szehon Ho
+1 (non-binding) 1. Verify signature 2. Verify checksum 3. License RAT check 4. Run unit test, Actually got a failure: org.apache.iceberg.spark.extensions.TestCopyOnWriteDelete > testDeleteWithSnapshotIsolation[catalogName = spark_catalog, implementation = org.apache.iceberg.spark.SparkSessionCa

Re: In Remembrance of Kyle

2022-12-06 Thread Szehon Ho
Very shocked when I first heard this over the weekend. Became more sad when I learned how long he was sick for, and so humbled that he chose to spend so much of his last days with us in the Iceberg community. I did not have a chance to work directly with him in Apple as I was on a different team.

Re: [DISCUSS] Action to Rewrite Equality Deletes as Position Deletes

2024-09-13 Thread Szehon Ho
+1, Id be happy to see this feature. Thanks Szehon On Fri, Sep 13, 2024 at 10:33 AM Prashant Singh wrote: > Hi All, > > Starting this thread to revive the discussion on converting Equality > Deletes as Position deletes and see if this is something community wants > now (Happy to contribute in t

Re: Spec changes for deletion vectors

2024-10-15 Thread Szehon Ho
This is awesome work by Anton and Ryan, it looks like a ton of effort has gone into the V3 position vector proposal to make it clean and efficient, a long time coming and Im truly excited to see the great improvement in storage/perf. wrt to these fields, I think most of the concerns are already me

Re: Spec changes for deletion vectors

2024-10-21 Thread Szehon Ho
um" for each option. Because we want to >>> be able to seek directly to the DV for a particular data file, I think it's >>> important to start the blob with magic bytes. That way the reader can >>> validate that the offset was correct and that the contents of the

Re: Spec changes for deletion vectors

2024-10-17 Thread Szehon Ho
rds compatibility by adding "reader" support for Delta Lake >>>>> DVs >>>>> in the spec, but not "writer support". >>>>> >>> d. Go forward with the current proposal but use offset and length >>>>

Re: [PROPOSAL] Add manifest-level statistics for CBO estimation

2024-10-24 Thread Szehon Ho
Hi Im just wondering, is a solution to put these stats in Puffin files? There's already ComputeTableStatsSparkAction (and probably similar actions in other engines), and I can imagine a quick metadata aggregation job to compute min/max/null_values, etc. Also how accurate would we need the stats? T

Re: [DISCUSS] Partial Metadata Loading

2024-11-05 Thread Szehon Ho
There seems to be many opinions here, but one of the main objections seems to be the complexity added to REST spec impeding newer catalogs. Looking through the actual REST API change proposal

Re: [VOTE] Deletion Vectors in V3

2024-10-30 Thread Szehon Ho
-0 Great work and exciting functionality, but transferring concerns from the other thread about the decision. Thanks Szehon On Wed, Oct 30, 2024 at 9:12 AM Steven Wu wrote: > +1 > > On Wed, Oct 30, 2024 at 1:07 AM xianjin wrote: > >> +1 (non binding) >> >> On Wed, Oct 30, 2024 at 2:28 PM Jean

Re: Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Szehon Ho
Hi Pucheng There were some parts in the implementation where column field ids collided with partition field ids. https://github.com/apache/iceberg/pull/10020 introduced mechanisms for affected code to get unique ids, and known places have been fixed. (Particularly the Spark procedure rewrite_posit

Re: Need clarification on max number of columns a Iceberg table can have

2024-10-31 Thread Szehon Ho
le issues but they are generally fixable. > > Is my understanding correct? > > Thanks again for your quick response. > > On Thu, Oct 31, 2024 at 5:50 PM Szehon Ho wrote: > >> Hi Pucheng >> >> There were some parts in the implementation where column field ids

Re: [Discuss] Geospatial Support

2024-09-30 Thread Szehon Ho
and team are also volunteering to work on the prototype immediately afterwards. Thank you, Szehon On Tue, Aug 20, 2024 at 1:57 PM Szehon Ho wrote: > Hi all > > Please take a look at the proposed spec change to support Geo type for V3 > in : https://github.com/apache/iceberg/pul

Re: [Discuss] Iceberg View Interoperability

2024-10-25 Thread Szehon Ho
he limitations > with the SQL representation (like not wanting to rewrite after resolution). > > On Fri, Oct 25, 2024 at 10:21 AM Szehon Ho > wrote: > >> Im dont have hands on experience on Substrait, but wondering, is >> substrait representation possible today with existi

Re: [Discuss] Iceberg View Interoperability

2024-10-25 Thread Szehon Ho
Im dont have hands on experience on Substrait, but wondering, is substrait representation possible today with existing Iceberg view spec? Ie, engines can store today the text serialized substrait representation with sql dialect 'substrait'? Or is it an abuse of spec and we should make a proper f

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Szehon Ho
>> because `HiveOperationsBase.validateTableIsIceberg` throws a >> `NoSuchTableException`. >> This would cause the code above to attempt to create the table, only to >> fail since the name already exists in the HMS. >> If `tableExists` is meant to check for conflicti

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Szehon Ho
> existing behavior for this part >>>>> >>>>> +1. I realized that this is not a new behavior. The `loadTable` >>>>> implementation has this problem too. >>>>> It would be good to have a test case specifically for this edge case >>&

Re: [Discuss] Document Snapshot Summary Optional Fields for Standardization

2024-11-27 Thread Szehon Ho
This makes sense to me generally, I've tried a few times to search in the spec to find a list of possible snapshot summary properties, and was a bit surprised to not find them there. So I think this would be a nice addition. I'm curious if there's any historical reason it's not been included in t

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Szehon Ho
> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho > wrote: > >> Hm I think the thread got a bit sidetracked by the other question. >> >> The initial proposal by Steve is a performance improvement for >> HiveCatalog's tableExists(). Currently it loads both Hive a

Welcome Huaxin Gao as a committer!

2025-02-06 Thread Szehon Ho
Hi everyone, The Project Management Committee (PMC) for Apache Iceberg has invited Huaxin Gao to become a committer, and I am happy to announce that she has accepted. Huaxin has done a lot of impressive work in areas such as Iceberg-Spark integration and recently Iceberg-Comet integrations. Thank

Re: [DISCUSS] Simplify multi-arg table metadata

2025-02-08 Thread Szehon Ho
Missed the thread, but with v3 now much closer than before, agree that the gain is not worth the risk. Thanks! Szehon On Fri, Feb 7, 2025 at 11:52 PM Xianjin Ye wrote: > +1. I think it's good timing to allow multi-arg transform for V3 and > onwards only. > > On 2025/02/03 18:26:00 "Driesprong,

[VOTE] Add Geometry and Geography types for V3

2025-02-06 Thread Szehon Ho
Hi everyone We would like to add Geometry and Geography types to the Iceberg V3 spec: https://github.com/apache/iceberg/pull/10981 This is proposed together with Apache Parquet format change to support geospatial data. https://github.com/apache/parquet-format/pull/240 This vote will be open fo

Re: [VOTE] Add Geometry and Geography types for V3

2025-02-10 Thread Szehon Ho
t;>>>> +1 >>>>>> >>>>>> Best regards, >>>>>> Honah >>>>>> >>>>>> On Fri, Feb 7, 2025 at 10:45 AM Aihua Xu wrote: >>>>>> >>>>>>> +1 (non-bi

Re: [VOTE] Add Geometry and Geography types for V3

2025-02-10 Thread Szehon Ho
at 11:42 AM Szehon Ho wrote: > Here is my +1 (binding) > > Thanks > Szehon > > On Mon, Feb 10, 2025 at 12:47 AM Eduard Tudenhöfner < > etudenhoef...@apache.org> wrote: > >> +1 >> >> On Sat, Feb 8, 2025 at 1:02 PM Fokko Driesprong wrote: >>

Re: New committer: Matt Topol

2024-12-10 Thread Szehon Ho
Congratulations, Matt! Thanks for the perseverance on Iceberg-Go Szehon On Tue, Dec 10, 2024 at 10:02 AM Bryan Keller wrote: > Congrats! > > -Bryan > > On Dec 10, 2024, at 7:37 AM, Matt Topol wrote: > > Thanks everyone! > > On Tue, Dec 10, 2024 at 9:26 AM Gang Wu wrote: > >> Congrats Matt! >>

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Szehon Ho
Should add, my personal preference is probably not to change the existing behavior for this part (false, if exists a Hive table with same name) at the moment, just adding another possibility for consideration. Thanks Szehon On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho wrote: > Thanks Kevin

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-21 Thread Szehon Ho
Hi, It's a good performance find and improvement. Left some comment on the PR. IMO, the behavior actually more matches the API javadoc ("Check whether table exists"), not whether it is corrupted or not, so I'm supportive of it. Thanks Szehon On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang wrote

Re: [DISCUSS] Deprecate embedded manifests

2024-11-21 Thread Szehon Ho
+1, great to have less possible paths. Thanks Szehon On Thu, Nov 21, 2024 at 10:33 AM Steve Zhang wrote: > +1 to deprecate > > Thanks, > Steve Zhang > > > > On Nov 19, 2024, at 3:32 AM, Fokko Driesprong wrote: > > Hi everyone, > > I would like to propose to deprecate embedded manifests >

Re: [Discuss] Geospatial Support

2024-12-06 Thread Szehon Ho
itzer < >> russell.spit...@gmail.com> wrote: >> >>> All my concerns are addressed, I'm ready to vote. >>> >>> On Mon, Sep 30, 2024 at 1:21 PM Szehon Ho >>> wrote: >>> >>>> Hi all, >>>> >>>> There have

Re: [VOTE] Document Snapshot Summary Optional Fields as Subsection of Appendix F in Spec

2025-01-21 Thread Szehon Ho
+1 (binding) Thanks Szehon On Tue, Jan 21, 2025 at 12:55 PM Yufei Gu wrote: > +1 Thanks Honah! > > Yufei > > > On Tue, Jan 21, 2025 at 12:45 PM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> +1 >> >> On Tue, Jan 21, 2025 at 2:36 PM rdb...@gmail.com >> wrote: >> >>> +1 >>> >>> On Tu

Re: [VOTE] Add overwriteRequested to RegisterTableRequest in REST spec

2025-02-13 Thread Szehon Ho
+1 Thanks Steve! Szehon On Thu, Feb 13, 2025 at 1:23 PM Yufei Gu wrote: > +1 (binding) > Yufei > > > On Thu, Feb 13, 2025 at 1:20 PM huaxin gao wrote: > >> +1 (non-binding) >> >> On Thu, Feb 13, 2025 at 11:51 AM Anurag Mantripragada >> wrote: >> >>> +1 (non-binding) >>> >>> Thanks, Steve! >>>

Re: [VOTE] Simplify multi-arg table metadata

2025-02-09 Thread Szehon Ho
+1 (binding) Thanks Fokko! Szehon > On Feb 9, 2025, at 8:14 AM, Jean-Baptiste Onofré wrote: > > +1 (non binding) > > Thanks to the cat :) > > Regards > JB > >> On Sun, Feb 9, 2025 at 10:01 AM Fokko Driesprong wrote: >> >> (Second attempt, the cat ran over the keyboard) >> >> Hey everyone

Re: [VOTE] Java implementation notes around current-snapshot-id

2025-02-24 Thread Szehon Ho
+1 Thanks Szehon On Mon, Feb 24, 2025 at 2:52 PM rdb...@gmail.com wrote: > +1 > > On Mon, Feb 24, 2025 at 12:26 PM Daniel Weeks wrote: > >> +1 >> >> On Mon, Feb 24, 2025, 11:00 AM Russell Spitzer >> wrote: >> >>> +1 >>> >>> On Mon, Feb 24, 2025 at 12:55 PM Fokko Driesprong >>> wrote: >>> >>>

Re: Spark: Copy Table Action

2025-02-20 Thread Szehon Ho
Hi Thanks to Steve Zhang, we have a doc now of how to use RewriteTablePaths as part of table replication (hot off the nightly doc build): https://iceberg.apache.org/docs/nightly/spark-procedures/#table-replication. You can use it in like: - RegisterTable , returns CopyPlan and lastVersionFileN

Re: Restrict orphan file removal to data/metadata directories

2025-03-05 Thread Szehon Ho
Hi Karuppayya Wanted to check, would a regex suffice for this use case (ie, match /data/*, /metadata/*) and to keep it more general ? The idea came from Dan in a one off chat. Thanks Szehon On Wed, Feb 26, 2025 at 1:41 PM Pucheng Yang wrote: > Yes, Iceberg spec does not define where the data

[VOTE] Minor simplifications for Geo Spec

2025-03-18 Thread Szehon Ho
Hi everyone, While working on the reference implementation for Geometry/Geography spec, we noticed some parts that can be simplified for this first version: 1. Default values should always be null (requires WKT serialization logic, for not many real world use cases) 2. JSON type serializ

Re: [VOTE] Minor simplifications for Geo Spec

2025-03-23 Thread Szehon Ho
) Therefore, the vote passes. Szehon On Sun, Mar 23, 2025 at 5:47 PM Szehon Ho wrote: > +1 > > On Sat, Mar 22, 2025 at 10:42 PM huaxin gao > wrote: > >> +1 (non-binding) >> >> On Sat, Mar 22, 2025 at 6:32 PM Prashant Singh >> wrote: >> >>> +1 (non b

Re: [VOTE] Minor simplifications for Geo Spec

2025-03-23 Thread Szehon Ho
+1 On Sat, Mar 22, 2025 at 10:42 PM huaxin gao wrote: > +1 (non-binding) > > On Sat, Mar 22, 2025 at 6:32 PM Prashant Singh > wrote: > >> +1 (non binding) >> >> Best, >> Prashant >> >> On Fri, Mar 21, 2025 at 10:03 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> +1 (bind >>> >>

<    1   2