Hi PF, Sure, rescheduled the meeting to an CET friendly time. The meeting now is scheduled on 9AM, March 11th, PST (6PM CST, March 11th). The meeting link is meet.google.com/ttd-jzid-abp Please feel free to slack me or tag me in the slack channel if anyone would like to get a meeting invitation (or you could directly join the meeting).
Best, Zaicheng Piotr Findeisen <pi...@starburstdata.com> 于2022年3月7日周一 21:54写道: > Hi Zaicheng, > > thanks for following up on this. I'm certainly interested. > The proposed time doesn't work for me though, I'm in the CET time zone. > > Best, > PF > > > On Sat, Mar 5, 2022 at 9:33 AM Zaicheng Wang <wcatp19891...@gmail.com> > wrote: > >> Hi dev folks, >> >> As discussed in the sync >> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.z3dncl7gr8m1> >> meeting, we will have a dedicated meeting on this topic. >> I tentatively scheduled a meeting on 4PM, March 8th PST time. The meeting >> link is https://meet.google.com/ttd-jzid-abp >> Please let me know if the time does not work for you. >> >> Thanks, >> Zaicheng >> >> zaicheng wang <wangzaich...@bytedance.com> 于2022年3月2日周三 21:17写道: >> >>> Hi folks, >>> >>> This is Zaicheng from bytedance. We spend some time working on solving >>> the index invalidation problem as we discussed in the dev email channel. >>> And when we are working on the POC, we also realize there are some >>> metadata changes that might be introduced. >>> We put these details into a document: >>> >>> https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing >>> The document includes two proposals for solving the index invalidation >>> problem: one from @Jack Ye’s idea on introducing a new sequence number, >>> another one is by leveraging the current manifest entry structure. The >>> document will also describe the corresponding table spec change. >>> Please let me know if you have any thoughts. We could also discuss this >>> during the sync meeting. >>> >>> Thanks, >>> Zaicheng >>> >>> On Tue, Feb 1, 2022 at 8:51 AM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in >>>> Slack. >>>> >>>> The goal here is to have a monotonically increasing number that could >>>> be used to detect what files have been newly added and should be indexed. >>>> This is especially important to know how up-to-date an index is for each >>>> partition. >>>> >>>> In a table without compaction, sequence number of files would continue >>>> to increase. If we have indexed all files up to sequence number 3, we know >>>> that the next indexing process needs to index all the files with sequence >>>> number greater than 3. But during compaction, files will be rewritten with >>>> the starting sequence number. During commit time the sequence number might >>>> already gone much higher. For example, I start compaction at seq=3, and >>>> when this is running for a few hours, there are 10 inserts done to the >>>> table, and the current sequence number is 13. When I commit the compacted >>>> data files, those files are essentially written to a sequence number older >>>> than the latest. This breaks a lot of assumption like (1) I cannot just >>>> find new data to index by calculating if the sequence number is higher than >>>> certain value, (2) a reader cannot determine if an index could be used >>>> based on the sequence number. >>>> >>>> The solution I was describing is to have another watermark that is >>>> monotonically increasing regardless of compaction or not. So Compaction >>>> would commit those files at seq=3, but the new watermark of those files are >>>> at 14. Then we can use this new watermark for all the index operations. >>>> >>>> Best, >>>> Jack Ye >>>> >>>> >>>> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wcatp19891...@gmail.com> >>>> wrote: >>>> >>>>> Hi Jack, >>>>> >>>>> >>>>> Thanks for the summary and it helps me a lot. >>>>> >>>>> Trying to understand point 2 and having my 2 cents. >>>>> >>>>> *a mechanism for tracking file change is needed. Unfortunately >>>>> sequence numbers cannot be used due to the introduction of compaction that >>>>> rewrites files into a lower sequence number. Another monotonically >>>>> increasing watermark for files has to be introduced for index change >>>>> detection and invalidation.* >>>>> >>>>> Please let me know if I have some wrong/silly assumptions. >>>>> >>>>> So the *reason* we couldn't use sequence numbers as the validness >>>>> indicator of the index is compaction. Before compaction (taking a very >>>>> simple example), the data file and index file should have a mapping and >>>>> the >>>>> tableScan.planTask() is able to decide whether to use index purely by >>>>> comparing sequence numbers (as well as index spec id, if we have one). >>>>> >>>>> After compaction, the tableScan.planTask() couldn't do so because data >>>>> file 5 is compacted to a new data file with seq = 10. Thus wrong plan >>>>> tasks >>>>> might be returned. >>>>> >>>>> I wonder how an additional watermark only for the index could solve >>>>> the problem? >>>>> >>>>> >>>>> And based on my gut feeling, I feel we could somehow solve the problem >>>>> with the current sequence number: >>>>> >>>>> *Option 1*: When compacting, we could compact those data files that >>>>> index is up to date to one group, those files that index is stale/not >>>>> exist >>>>> to another group. (Just like what we are doing with the data file that are >>>>> unpartitioned/partition spec id not match). >>>>> >>>>> The *pro* is that we could still leverage indexes for part of the >>>>> data files, and we could reuse the sequence number. >>>>> >>>>> The *cons* are that the compaction might not reach the target size >>>>> and we might still have small files. >>>>> >>>>> *Option 2*: >>>>> >>>>> Assume compaction is often triggered by data engineers and the >>>>> compaction action is not so frequent. We could directly invalid all index >>>>> files for those compacted. And the user needs to rebuild the index every >>>>> time after compaction. >>>>> >>>>> *Pro*: Easy to implement, clear to understand. >>>>> >>>>> *Cons*: Relatively bad user experience. Waste some computing >>>>> resources to redo some work. >>>>> >>>>> *Option 3*: >>>>> >>>>> We could leverage the engine's computing resource to always rebuild >>>>> indexes during data compaction. >>>>> >>>>> *Pro*: User could leverage index after the data compaction. >>>>> >>>>> *Cons*: Rebuilding might take longer time/resources. >>>>> >>>>> *Option 3 alternative*: add a configuration property to compaction, >>>>> control if the user wants to rebuild the index during compaction. >>>>> >>>>> >>>>> Please let me know if you have any thoughts on this. >>>>> >>>>> Best, >>>>> >>>>> Zaicheng >>>>> >>>>> Jack Ye <yezhao...@gmail.com> 于2022年1月26日周三 13:17写道: >>>>> >>>>>> Thanks for the fast responses! >>>>>> >>>>>> Based on the conversations above, it sounds like we have the >>>>>> following consensus: >>>>>> >>>>>> 1. asynchronous index creation is preferred, although synchronous >>>>>> index creation is possible. >>>>>> 2. a mechanism for tracking file change is needed. Unfortunately >>>>>> sequence number cannot be used due to the introduction of compaction that >>>>>> rewrites files into a lower sequence number. Another monotonically >>>>>> increasing watermark for files has to be introduced for index change >>>>>> detection and invalidation. >>>>>> 3. index creation and maintenance procedures should be pluggable by >>>>>> different engines. This should not be an issue because Iceberg has been >>>>>> designing action interfaces for different table maintenance procedures so >>>>>> far, so what Zaicheng describes should be the natural development >>>>>> direction >>>>>> once the work is started. >>>>>> >>>>>> Regarding index level, I also think partition level index is more >>>>>> important, but it seems like we have to first do file level as the >>>>>> foundation. This leads to the index storage part. I am not talking about >>>>>> using Parquet to store it, I am asking about what Miao is describing. I >>>>>> don't think we have a consensus around the exact place to store index >>>>>> information yet. My memory is that there are 2 ways: >>>>>> 1. file level index stored as a binary field in manifest, partition >>>>>> level index stored as a binary field in manifest list. This would only >>>>>> work >>>>>> for small size indexes like bitmap (or bloom filter to certain extent) >>>>>> 2. some sort of binary file to store index data, and index metadata >>>>>> (e.g. index type) and pointer to the binary index data file is kept in 1 >>>>>> (I >>>>>> think this is what Miao is describing) >>>>>> 3. some sort of index spec to independently store index metadata and >>>>>> data, similar to what we are proposing today for view >>>>>> >>>>>> Another aspect of index storage is the index file location in case of >>>>>> 2 and 3. In the original doc a specific file path structure is proposed, >>>>>> whereas this is a bit against the Iceberg standard of not assuming file >>>>>> path to work with any storage. We also need more clarity in that topic. >>>>>> >>>>>> Best, >>>>>> Jack Ye >>>>>> >>>>>> >>>>>> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang < >>>>>> wcatp19891...@gmail.com> wrote: >>>>>> >>>>>>> Thanks for having the thread. This is Zaicheng from bytedance. >>>>>>> >>>>>>> Initially we are planning to add index feature for our internal >>>>>>> Trino and feel like iceberg could be the best place for holding/buiding >>>>>>> the >>>>>>> index data. >>>>>>> We are very interested in having and contributing to this feature. >>>>>>> (Pretty new to the community, still having my 2 cents) >>>>>>> >>>>>>> Echo on what Miao mentioned on 4): I feel iceberg could provide >>>>>>> interface for creating/updating/deleting index and each engine can >>>>>>> decide >>>>>>> how to invoke these method (in a distributed manner or single thread >>>>>>> manner, in async or sync). >>>>>>> Take our use case as an example, we plan to have a new DDL syntax >>>>>>> "create index id_1 on table col_1 using bloom"/"update index id_1 on >>>>>>> table >>>>>>> col_1", and our SQL engine will create distributed index >>>>>>> creation/updating >>>>>>> operator. Each operator will invoke the index related method provided by >>>>>>> iceberg. >>>>>>> >>>>>>> Storage): Does the index data have to be a file? Wondering if we >>>>>>> want to design the index data storage interface in such way that people >>>>>>> can >>>>>>> plugin different index storage(file storage/centralized index storage >>>>>>> service) later on. >>>>>>> >>>>>>> Thanks, >>>>>>> Zaicheng >>>>>>> >>>>>>> >>>>>>> Miao Wang <miw...@adobe.com.invalid> 于2022年1月26日周三 10:22写道: >>>>>>> >>>>>>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance >>>>>>>> created a slack channel for index work. I suggested him adding Anton >>>>>>>> and >>>>>>>> you to the channel. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I still remember some conclusions from previous discussions. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 1). Index types support: We planned to support Skipping Index >>>>>>>> first. Iceberg metadata exposes hints whether the tracked data files >>>>>>>> have >>>>>>>> index which reduces index reading overhead. Index file can be applied >>>>>>>> when >>>>>>>> generating the scan task. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2). As Ryan mentioned, Sequence number will be used to indicate >>>>>>>> whether an index is valid. Sequence number can link the data evolution >>>>>>>> with >>>>>>>> index evolution. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 3). Storage: We planned to have simple file format which includes >>>>>>>> Column Name/ID, Index Type (String), Index content length, and binary >>>>>>>> content. It is not necessary to use Parquet to store index. Initial >>>>>>>> thought >>>>>>>> was 1 data file mapping to 1 index file. It can be merged to 1 >>>>>>>> partition >>>>>>>> mapping to 1 index file. As Ryan said, file level implementation could >>>>>>>> be a >>>>>>>> step stone for Partition level implementation. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 4). How to build index: We want to keep the index reading and >>>>>>>> writing interface with Iceberg and leave the actual building logic as >>>>>>>> Engine specific (i.e., we can use different compute to build Index >>>>>>>> without >>>>>>>> changing anything inside Iceberg). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Misc: >>>>>>>> >>>>>>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code >>>>>>>> base. >>>>>>>> >>>>>>>> Design doc: >>>>>>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit >>>>>>>> >>>>>>>> PR should have been merged. >>>>>>>> >>>>>>>> Guy from IBM did a partial PoC and provided a private doc. I will >>>>>>>> ask if he can make it public. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> We can continue the discussion and breaking down the big tasks into >>>>>>>> tickets. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Miao >>>>>>>> >>>>>>>> *From: *Ryan Blue <b...@tabular.io> >>>>>>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM >>>>>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org> >>>>>>>> *Subject: *Re: Continuing the Secondary Index Discussion >>>>>>>> >>>>>>>> Thanks for raising this for discussion, Jack! It would be great to >>>>>>>> start adding more indexes. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> > Scope of native index support >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The way I think about it, the biggest challenge here is how to know >>>>>>>> when you can use an index. For example, if you have a partition index >>>>>>>> that >>>>>>>> is up to date as of snapshot 13764091836784, but the current snapshot >>>>>>>> is >>>>>>>> 97613097151667, then you basically have no idea what files are covered >>>>>>>> or >>>>>>>> not and can't use it. On the other hand, if you know that the index >>>>>>>> was up >>>>>>>> to date as of sequence number 11 and you're reading sequence number 12, >>>>>>>> then you just have to read any data file that was written at sequence >>>>>>>> number 12. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The problem of where you can use an index makes me think that it is >>>>>>>> best to maintain index metadata within Iceberg. An alternative is to >>>>>>>> try to >>>>>>>> always keep the index up-to-date, but I don't think that's necessarily >>>>>>>> possible -- you'd have to support index updates in every writer that >>>>>>>> touches table data. You would have to spend the time updating indexes >>>>>>>> at >>>>>>>> write time, but there are competing priorities like making data >>>>>>>> available. >>>>>>>> So I think you want asynchronous index updates and that leads to >>>>>>>> integration with the table format. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> > Index levels >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I think that partition-level indexes are better for job planning >>>>>>>> (eliminate whole partitions!) but file-level are still useful for >>>>>>>> skipping >>>>>>>> files at the task level. I would probably focus on partition-level, >>>>>>>> but I'm >>>>>>>> not strongly opinionated here. File-level is probably a stepping stone >>>>>>>> to >>>>>>>> partition-level, given that we would be able to track index data in the >>>>>>>> same format. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> > Index storage >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Do you mean putting indexes in Parquet, or using Parquet for >>>>>>>> indexes? I think that bloom filters would probably exceed the amount of >>>>>>>> data we'd want to put into a Parquet binary column, probably at the >>>>>>>> file >>>>>>>> level and almost certainly at the partition level, since the size >>>>>>>> depends >>>>>>>> on the number of distinct values and the primary use is for >>>>>>>> identifiers. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> > Indexing process >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Synchronous is nice, but as I said above, I think we have to >>>>>>>> support async because it is too complicated to update every writer that >>>>>>>> touches a table and you may not want to pay the price at write time. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> > Index validation >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I think this is pretty much what I talked about for question 1. I >>>>>>>> think that we have a good plan around using sequence numbers, if we >>>>>>>> want to >>>>>>>> do this. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <yezhao...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Based on the conversation in the last community sync and the >>>>>>>> Iceberg Slack channel, it seems like multiple parties have interest in >>>>>>>> continuing the effort related to the secondary index in Iceberg, so I >>>>>>>> would >>>>>>>> like to restart the thread to continue the discussion. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> So far most people refer to the document authored by Miao Wang >>>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0> >>>>>>>> which has a lot of useful information about the design and >>>>>>>> implementation. >>>>>>>> However, the document is also quite old (over a year now) and a lot has >>>>>>>> changed in Iceberg since then. I think the document leaves the >>>>>>>> following >>>>>>>> open topics that we need to continue to address: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 1. *scope of native index support*: what type of index should >>>>>>>> Iceberg support natively, how should developers allocate effort between >>>>>>>> adding support of Iceberg native index compared to developing Iceberg >>>>>>>> support for holistic indexing projects such as HyperSpace >>>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0> >>>>>>>> . >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2. *index levels*: we have talked about partition level indexing >>>>>>>> and file level indexing. More clarity is needed for these index levels >>>>>>>> and >>>>>>>> the level of interest and support needed for those different indexing >>>>>>>> levels. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 3. *index storage*: we had unsettled debates around making index >>>>>>>> separated files or embedding it as a part of existing Iceberg file >>>>>>>> structure. We need to come up with certain criteria such as index size, >>>>>>>> easiness to generate during write, etc. to settle the discussion. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 4. *Indexing process*: as stated in Miao's document, indexes could >>>>>>>> be created during the data writing process synchronously, or built >>>>>>>> asynchronously through an index service. Discussion is needed for the >>>>>>>> focus >>>>>>>> of the Iceberg index functionalities. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 5. *index invalidation*: depends on the scope and level, certain >>>>>>>> indexes need to be invalidated during operations like RewriteFiles. >>>>>>>> Clarity >>>>>>>> is needed in this domain, including if we need another sequence number >>>>>>>> to >>>>>>>> track such invalidation. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I suggest we iterate a bit on this list of open questions, and then >>>>>>>> we can have a meeting to discuss those aspects, and produce an updated >>>>>>>> document addressing those aspects to provide a clear path forward for >>>>>>>> developers interested in adding features in this domain. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Any thoughts? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jack Ye >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Ryan Blue >>>>>>>> >>>>>>>> Tabular >>>>>>>> >>>>>>>