Re: [External] Re: Continuing the Secondary Index Discussion

2022-03-08 Thread Zaicheng Wang
gt;>>>> how to invoke these method (in a distributed manner or single thread >>>>>>> manner, in async or sync). >>>>>>> Take our use case as an example, we plan to have a new DDL syntax >>>>>>> "create index id_1 on table col_1 using bloom"/"update

Re: [External] Re: Continuing the Secondary Index Discussion

2022-03-07 Thread Piotr Findeisen
ill invoke the index related method provided by >>>>>> iceberg. >>>>>> >>>>>> Storage): Does the index data have to be a file? Wondering if we want >>>>>> to design the index data storage interface in such way that people c

Re: [External] Re: Continuing the Secondary Index Discussion

2022-03-05 Thread Zaicheng Wang
;>>>> >>>>>> >>>>>> I still remember some conclusions from previous discussions. >>>>>> >>>>>> >>>>>> >>>>>> 1). Index types support: We planned to support Skipping Index first. &

Re: [External] Re: Continuing the Secondary Index Discussion

2022-03-02 Thread zaicheng wang
h reduces index reading overhead. Index file can be applied when >>>>> generating the scan task. >>>>> >>>>> >>>>> >>>>> 2). As Ryan mentioned, Sequence number will be used to indicate >>>>> whether an index is valid. Sequence

Re: Continuing the Secondary Index Discussion

2022-01-31 Thread Jack Ye
format which includes >>>> Column Name/ID, Index Type (String), Index content length, and binary >>>> content. It is not necessary to use Parquet to store index. Initial thought >>>> was 1 data file mapping to 1 index file. It can be merged to 1 partition >>

Re: Continuing the Secondary Index Discussion

2022-01-29 Thread Zaicheng Wang
the index reading and writing >>> interface with Iceberg and leave the actual building logic as Engine >>> specific (i.e., we can use different compute to build Index without >>> changing anything inside Iceberg). >>> >>> >>> >>> Misc:

Re: Continuing the Secondary Index Discussion

2022-01-26 Thread melin li
; 4). How to build index: We want to keep the index reading and writing >>> interface with Iceberg and leave the actual building logic as Engine >>> specific (i.e., we can use different compute to build Index without >>> changing anything inside Iceberg). >>> >

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Jack Ye
Index support API for DSv2 in Spark 3.x code base. >> >> Design doc: >> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit >> >> PR should have been merged. >> >> Guy from IBM did a partial PoC and provided a private doc. I

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Zaicheng Wang
> > We can continue the discussion and breaking down the big tasks into > tickets. > > > > Thanks! > > > > Miao > > *From: *Ryan Blue > *Date: *Tuesday, January 25, 2022 at 5:08 PM > *To: *Iceberg Dev List > *Subject: *Re: Continuing the Secondary

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Miao Wang
did a partial PoC and provided a private doc. I will ask if he can make it public. We can continue the discussion and breaking down the big tasks into tickets. Thanks! Miao From: Ryan Blue Date: Tuesday, January 25, 2022 at 5:08 PM To: Iceberg Dev List Subject: Re: Continuing the Secondary

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Ryan Blue
Thanks for raising this for discussion, Jack! It would be great to start adding more indexes. > Scope of native index support The way I think about it, the biggest challenge here is how to know when you can use an index. For example, if you have a partition index that is up to date as of snapshot

Continuing the Secondary Index Discussion

2022-01-25 Thread Jack Ye
Hi everyone, Based on the conversation in the last community sync and the Iceberg Slack channel, it seems like multiple parties have interest in continuing the effort related to the secondary index in Iceberg, so I would like to restart the thread to continue the discussion. So far most people re