Hi dev folks, As discussed in the sync <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.z3dncl7gr8m1> meeting, we will have a dedicated meeting on this topic. I tentatively scheduled a meeting on 4PM, March 8th PST time. The meeting link is https://meet.google.com/ttd-jzid-abp Please let me know if the time does not work for you.
Thanks, Zaicheng zaicheng wang <wangzaich...@bytedance.com> 于2022年3月2日周三 21:17写道: > Hi folks, > > This is Zaicheng from bytedance. We spend some time working on solving the > index invalidation problem as we discussed in the dev email channel. > And when we are working on the POC, we also realize there are some > metadata changes that might be introduced. > We put these details into a document: > > https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing > The document includes two proposals for solving the index invalidation > problem: one from @Jack Ye’s idea on introducing a new sequence number, > another one is by leveraging the current manifest entry structure. The > document will also describe the corresponding table spec change. > Please let me know if you have any thoughts. We could also discuss this > during the sync meeting. > > Thanks, > Zaicheng > > On Tue, Feb 1, 2022 at 8:51 AM Jack Ye <yezhao...@gmail.com> wrote: > >> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in Slack. >> >> The goal here is to have a monotonically increasing number that could be >> used to detect what files have been newly added and should be indexed. This >> is especially important to know how up-to-date an index is for each >> partition. >> >> In a table without compaction, sequence number of files would continue to >> increase. If we have indexed all files up to sequence number 3, we know >> that the next indexing process needs to index all the files with sequence >> number greater than 3. But during compaction, files will be rewritten with >> the starting sequence number. During commit time the sequence number might >> already gone much higher. For example, I start compaction at seq=3, and >> when this is running for a few hours, there are 10 inserts done to the >> table, and the current sequence number is 13. When I commit the compacted >> data files, those files are essentially written to a sequence number older >> than the latest. This breaks a lot of assumption like (1) I cannot just >> find new data to index by calculating if the sequence number is higher than >> certain value, (2) a reader cannot determine if an index could be used >> based on the sequence number. >> >> The solution I was describing is to have another watermark that is >> monotonically increasing regardless of compaction or not. So Compaction >> would commit those files at seq=3, but the new watermark of those files are >> at 14. Then we can use this new watermark for all the index operations. >> >> Best, >> Jack Ye >> >> >> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wcatp19891...@gmail.com> >> wrote: >> >>> Hi Jack, >>> >>> >>> Thanks for the summary and it helps me a lot. >>> >>> Trying to understand point 2 and having my 2 cents. >>> >>> *a mechanism for tracking file change is needed. Unfortunately sequence >>> numbers cannot be used due to the introduction of compaction that rewrites >>> files into a lower sequence number. Another monotonically increasing >>> watermark for files has to be introduced for index change detection and >>> invalidation.* >>> >>> Please let me know if I have some wrong/silly assumptions. >>> >>> So the *reason* we couldn't use sequence numbers as the validness >>> indicator of the index is compaction. Before compaction (taking a very >>> simple example), the data file and index file should have a mapping and the >>> tableScan.planTask() is able to decide whether to use index purely by >>> comparing sequence numbers (as well as index spec id, if we have one). >>> >>> After compaction, the tableScan.planTask() couldn't do so because data >>> file 5 is compacted to a new data file with seq = 10. Thus wrong plan tasks >>> might be returned. >>> >>> I wonder how an additional watermark only for the index could solve the >>> problem? >>> >>> >>> And based on my gut feeling, I feel we could somehow solve the problem >>> with the current sequence number: >>> >>> *Option 1*: When compacting, we could compact those data files that >>> index is up to date to one group, those files that index is stale/not exist >>> to another group. (Just like what we are doing with the data file that are >>> unpartitioned/partition spec id not match). >>> >>> The *pro* is that we could still leverage indexes for part of the data >>> files, and we could reuse the sequence number. >>> >>> The *cons* are that the compaction might not reach the target size and >>> we might still have small files. >>> >>> *Option 2*: >>> >>> Assume compaction is often triggered by data engineers and the >>> compaction action is not so frequent. We could directly invalid all index >>> files for those compacted. And the user needs to rebuild the index every >>> time after compaction. >>> >>> *Pro*: Easy to implement, clear to understand. >>> >>> *Cons*: Relatively bad user experience. Waste some computing resources >>> to redo some work. >>> >>> *Option 3*: >>> >>> We could leverage the engine's computing resource to always rebuild >>> indexes during data compaction. >>> >>> *Pro*: User could leverage index after the data compaction. >>> >>> *Cons*: Rebuilding might take longer time/resources. >>> >>> *Option 3 alternative*: add a configuration property to compaction, >>> control if the user wants to rebuild the index during compaction. >>> >>> >>> Please let me know if you have any thoughts on this. >>> >>> Best, >>> >>> Zaicheng >>> >>> Jack Ye <yezhao...@gmail.com> 于2022年1月26日周三 13:17写道: >>> >>>> Thanks for the fast responses! >>>> >>>> Based on the conversations above, it sounds like we have the following >>>> consensus: >>>> >>>> 1. asynchronous index creation is preferred, although synchronous index >>>> creation is possible. >>>> 2. a mechanism for tracking file change is needed. Unfortunately >>>> sequence number cannot be used due to the introduction of compaction that >>>> rewrites files into a lower sequence number. Another monotonically >>>> increasing watermark for files has to be introduced for index change >>>> detection and invalidation. >>>> 3. index creation and maintenance procedures should be pluggable by >>>> different engines. This should not be an issue because Iceberg has been >>>> designing action interfaces for different table maintenance procedures so >>>> far, so what Zaicheng describes should be the natural development direction >>>> once the work is started. >>>> >>>> Regarding index level, I also think partition level index is more >>>> important, but it seems like we have to first do file level as the >>>> foundation. This leads to the index storage part. I am not talking about >>>> using Parquet to store it, I am asking about what Miao is describing. I >>>> don't think we have a consensus around the exact place to store index >>>> information yet. My memory is that there are 2 ways: >>>> 1. file level index stored as a binary field in manifest, partition >>>> level index stored as a binary field in manifest list. This would only work >>>> for small size indexes like bitmap (or bloom filter to certain extent) >>>> 2. some sort of binary file to store index data, and index metadata >>>> (e.g. index type) and pointer to the binary index data file is kept in 1 (I >>>> think this is what Miao is describing) >>>> 3. some sort of index spec to independently store index metadata and >>>> data, similar to what we are proposing today for view >>>> >>>> Another aspect of index storage is the index file location in case of 2 >>>> and 3. In the original doc a specific file path structure is proposed, >>>> whereas this is a bit against the Iceberg standard of not assuming file >>>> path to work with any storage. We also need more clarity in that topic. >>>> >>>> Best, >>>> Jack Ye >>>> >>>> >>>> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <wcatp19891...@gmail.com> >>>> wrote: >>>> >>>>> Thanks for having the thread. This is Zaicheng from bytedance. >>>>> >>>>> Initially we are planning to add index feature for our internal Trino >>>>> and feel like iceberg could be the best place for holding/buiding the >>>>> index >>>>> data. >>>>> We are very interested in having and contributing to this feature. >>>>> (Pretty new to the community, still having my 2 cents) >>>>> >>>>> Echo on what Miao mentioned on 4): I feel iceberg could provide >>>>> interface for creating/updating/deleting index and each engine can decide >>>>> how to invoke these method (in a distributed manner or single thread >>>>> manner, in async or sync). >>>>> Take our use case as an example, we plan to have a new DDL syntax >>>>> "create index id_1 on table col_1 using bloom"/"update index id_1 on table >>>>> col_1", and our SQL engine will create distributed index creation/updating >>>>> operator. Each operator will invoke the index related method provided by >>>>> iceberg. >>>>> >>>>> Storage): Does the index data have to be a file? Wondering if we want >>>>> to design the index data storage interface in such way that people can >>>>> plugin different index storage(file storage/centralized index storage >>>>> service) later on. >>>>> >>>>> Thanks, >>>>> Zaicheng >>>>> >>>>> >>>>> Miao Wang <miw...@adobe.com.invalid> 于2022年1月26日周三 10:22写道: >>>>> >>>>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance >>>>>> created a slack channel for index work. I suggested him adding Anton and >>>>>> you to the channel. >>>>>> >>>>>> >>>>>> >>>>>> I still remember some conclusions from previous discussions. >>>>>> >>>>>> >>>>>> >>>>>> 1). Index types support: We planned to support Skipping Index first. >>>>>> Iceberg metadata exposes hints whether the tracked data files have index >>>>>> which reduces index reading overhead. Index file can be applied when >>>>>> generating the scan task. >>>>>> >>>>>> >>>>>> >>>>>> 2). As Ryan mentioned, Sequence number will be used to indicate >>>>>> whether an index is valid. Sequence number can link the data evolution >>>>>> with >>>>>> index evolution. >>>>>> >>>>>> >>>>>> >>>>>> 3). Storage: We planned to have simple file format which includes >>>>>> Column Name/ID, Index Type (String), Index content length, and binary >>>>>> content. It is not necessary to use Parquet to store index. Initial >>>>>> thought >>>>>> was 1 data file mapping to 1 index file. It can be merged to 1 partition >>>>>> mapping to 1 index file. As Ryan said, file level implementation could >>>>>> be a >>>>>> step stone for Partition level implementation. >>>>>> >>>>>> >>>>>> >>>>>> 4). How to build index: We want to keep the index reading and writing >>>>>> interface with Iceberg and leave the actual building logic as Engine >>>>>> specific (i.e., we can use different compute to build Index without >>>>>> changing anything inside Iceberg). >>>>>> >>>>>> >>>>>> >>>>>> Misc: >>>>>> >>>>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code base. >>>>>> >>>>>> Design doc: >>>>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit >>>>>> >>>>>> PR should have been merged. >>>>>> >>>>>> Guy from IBM did a partial PoC and provided a private doc. I will ask >>>>>> if he can make it public. >>>>>> >>>>>> >>>>>> >>>>>> We can continue the discussion and breaking down the big tasks into >>>>>> tickets. >>>>>> >>>>>> >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> Miao >>>>>> >>>>>> *From: *Ryan Blue <b...@tabular.io> >>>>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM >>>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org> >>>>>> *Subject: *Re: Continuing the Secondary Index Discussion >>>>>> >>>>>> Thanks for raising this for discussion, Jack! It would be great to >>>>>> start adding more indexes. >>>>>> >>>>>> >>>>>> >>>>>> > Scope of native index support >>>>>> >>>>>> >>>>>> >>>>>> The way I think about it, the biggest challenge here is how to know >>>>>> when you can use an index. For example, if you have a partition index >>>>>> that >>>>>> is up to date as of snapshot 13764091836784, but the current snapshot is >>>>>> 97613097151667, then you basically have no idea what files are covered or >>>>>> not and can't use it. On the other hand, if you know that the index was >>>>>> up >>>>>> to date as of sequence number 11 and you're reading sequence number 12, >>>>>> then you just have to read any data file that was written at sequence >>>>>> number 12. >>>>>> >>>>>> >>>>>> >>>>>> The problem of where you can use an index makes me think that it is >>>>>> best to maintain index metadata within Iceberg. An alternative is to try >>>>>> to >>>>>> always keep the index up-to-date, but I don't think that's necessarily >>>>>> possible -- you'd have to support index updates in every writer that >>>>>> touches table data. You would have to spend the time updating indexes at >>>>>> write time, but there are competing priorities like making data >>>>>> available. >>>>>> So I think you want asynchronous index updates and that leads to >>>>>> integration with the table format. >>>>>> >>>>>> >>>>>> >>>>>> > Index levels >>>>>> >>>>>> >>>>>> >>>>>> I think that partition-level indexes are better for job planning >>>>>> (eliminate whole partitions!) but file-level are still useful for >>>>>> skipping >>>>>> files at the task level. I would probably focus on partition-level, but >>>>>> I'm >>>>>> not strongly opinionated here. File-level is probably a stepping stone to >>>>>> partition-level, given that we would be able to track index data in the >>>>>> same format. >>>>>> >>>>>> >>>>>> >>>>>> > Index storage >>>>>> >>>>>> >>>>>> >>>>>> Do you mean putting indexes in Parquet, or using Parquet for indexes? >>>>>> I think that bloom filters would probably exceed the amount of data we'd >>>>>> want to put into a Parquet binary column, probably at the file level and >>>>>> almost certainly at the partition level, since the size depends on the >>>>>> number of distinct values and the primary use is for identifiers. >>>>>> >>>>>> >>>>>> >>>>>> > Indexing process >>>>>> >>>>>> >>>>>> >>>>>> Synchronous is nice, but as I said above, I think we have to support >>>>>> async because it is too complicated to update every writer that touches a >>>>>> table and you may not want to pay the price at write time. >>>>>> >>>>>> >>>>>> >>>>>> > Index validation >>>>>> >>>>>> >>>>>> >>>>>> I think this is pretty much what I talked about for question 1. I >>>>>> think that we have a good plan around using sequence numbers, if we want >>>>>> to >>>>>> do this. >>>>>> >>>>>> >>>>>> >>>>>> Ryan >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>> Hi everyone, >>>>>> >>>>>> >>>>>> >>>>>> Based on the conversation in the last community sync and the Iceberg >>>>>> Slack channel, it seems like multiple parties have interest in continuing >>>>>> the effort related to the secondary index in Iceberg, so I would like to >>>>>> restart the thread to continue the discussion. >>>>>> >>>>>> >>>>>> >>>>>> So far most people refer to the document authored by Miao Wang >>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0> >>>>>> which has a lot of useful information about the design and >>>>>> implementation. >>>>>> However, the document is also quite old (over a year now) and a lot has >>>>>> changed in Iceberg since then. I think the document leaves the following >>>>>> open topics that we need to continue to address: >>>>>> >>>>>> >>>>>> >>>>>> 1. *scope of native index support*: what type of index should >>>>>> Iceberg support natively, how should developers allocate effort between >>>>>> adding support of Iceberg native index compared to developing Iceberg >>>>>> support for holistic indexing projects such as HyperSpace >>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0> >>>>>> . >>>>>> >>>>>> >>>>>> >>>>>> 2. *index levels*: we have talked about partition level indexing and >>>>>> file level indexing. More clarity is needed for these index levels and >>>>>> the >>>>>> level of interest and support needed for those different indexing levels. >>>>>> >>>>>> >>>>>> >>>>>> 3. *index storage*: we had unsettled debates around making index >>>>>> separated files or embedding it as a part of existing Iceberg file >>>>>> structure. We need to come up with certain criteria such as index size, >>>>>> easiness to generate during write, etc. to settle the discussion. >>>>>> >>>>>> >>>>>> >>>>>> 4. *Indexing process*: as stated in Miao's document, indexes could >>>>>> be created during the data writing process synchronously, or built >>>>>> asynchronously through an index service. Discussion is needed for the >>>>>> focus >>>>>> of the Iceberg index functionalities. >>>>>> >>>>>> >>>>>> >>>>>> 5. *index invalidation*: depends on the scope and level, certain >>>>>> indexes need to be invalidated during operations like RewriteFiles. >>>>>> Clarity >>>>>> is needed in this domain, including if we need another sequence number to >>>>>> track such invalidation. >>>>>> >>>>>> >>>>>> >>>>>> I suggest we iterate a bit on this list of open questions, and then >>>>>> we can have a meeting to discuss those aspects, and produce an updated >>>>>> document addressing those aspects to provide a clear path forward for >>>>>> developers interested in adding features in this domain. >>>>>> >>>>>> >>>>>> >>>>>> Any thoughts? >>>>>> >>>>>> >>>>>> >>>>>> Best, >>>>>> >>>>>> Jack Ye >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Ryan Blue >>>>>> >>>>>> Tabular >>>>>> >>>>>