Re: [External] Re: Continuing the Secondary Index Discussion

2022-03-08 Thread Zaicheng Wang
gt;>>>> how to invoke these method (in a distributed manner or single thread >>>>>>> manner, in async or sync). >>>>>>> Take our use case as an example, we plan to have a new DDL syntax >>>>>>> "create index id_1 on table col_1 using bloom"/"update

Re: [External] Re: Continuing the Secondary Index Discussion

2022-03-07 Thread Piotr Findeisen
ill invoke the index related method provided by >>>>>> iceberg. >>>>>> >>>>>> Storage): Does the index data have to be a file? Wondering if we want >>>>>> to design the index data storage interface in such way that people c

Re: [External] Re: Continuing the Secondary Index Discussion

2022-03-05 Thread Zaicheng Wang
;>>>> >>>>>> >>>>>> I still remember some conclusions from previous discussions. >>>>>> >>>>>> >>>>>> >>>>>> 1). Index types support: We planned to support Skipping Index first. &

Re: [External] Re: Continuing the Secondary Index Discussion

2022-03-02 Thread zaicheng wang
h reduces index reading overhead. Index file can be applied when >>>>> generating the scan task. >>>>> >>>>> >>>>> >>>>> 2). As Ryan mentioned, Sequence number will be used to indicate >>>>> whether an index is valid. Sequence

Re: Continuing the Secondary Index Discussion

2022-01-31 Thread Jack Ye
format which includes >>>> Column Name/ID, Index Type (String), Index content length, and binary >>>> content. It is not necessary to use Parquet to store index. Initial thought >>>> was 1 data file mapping to 1 index file. It can be merged to 1 partition >>

Re: Continuing the Secondary Index Discussion

2022-01-29 Thread Zaicheng Wang
the index reading and writing >>> interface with Iceberg and leave the actual building logic as Engine >>> specific (i.e., we can use different compute to build Index without >>> changing anything inside Iceberg). >>> >>> >>> >>> Misc:

Re: Continuing the Secondary Index Discussion

2022-01-26 Thread melin li
; 4). How to build index: We want to keep the index reading and writing >>> interface with Iceberg and leave the actual building logic as Engine >>> specific (i.e., we can use different compute to build Index without >>> changing anything inside Iceberg). >>> >

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Jack Ye
Index support API for DSv2 in Spark 3.x code base. >> >> Design doc: >> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit >> >> PR should have been merged. >> >> Guy from IBM did a partial PoC and provided a private doc. I

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Zaicheng Wang
> > We can continue the discussion and breaking down the big tasks into > tickets. > > > > Thanks! > > > > Miao > > *From: *Ryan Blue > *Date: *Tuesday, January 25, 2022 at 5:08 PM > *To: *Iceberg Dev List > *Subject: *Re: Continuing the Secondary

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Miao Wang
did a partial PoC and provided a private doc. I will ask if he can make it public. We can continue the discussion and breaking down the big tasks into tickets. Thanks! Miao From: Ryan Blue Date: Tuesday, January 25, 2022 at 5:08 PM To: Iceberg Dev List Subject: Re: Continuing the Secondary

Re: Continuing the Secondary Index Discussion

2022-01-25 Thread Ryan Blue
Thanks for raising this for discussion, Jack! It would be great to start adding more indexes. > Scope of native index support The way I think about it, the biggest challenge here is how to know when you can use an index. For example, if you have a partition index that is up to date as of snapshot