[Data skipping Index to improve query performance] https://github.com/apache/hudi/blob/920f45926a3112b6d045ca3b434bc7c4e55e5e3c/rfc/rfc-27/rfc-27.md
Jack Ye <yezhao...@gmail.com> 于2022年1月26日周三 13:17写道: > Thanks for the fast responses! > > Based on the conversations above, it sounds like we have the following > consensus: > > 1. asynchronous index creation is preferred, although synchronous index > creation is possible. > 2. a mechanism for tracking file change is needed. Unfortunately sequence > number cannot be used due to the introduction of compaction that rewrites > files into a lower sequence number. Another monotonically increasing > watermark for files has to be introduced for index change detection and > invalidation. > 3. index creation and maintenance procedures should be pluggable by > different engines. This should not be an issue because Iceberg has been > designing action interfaces for different table maintenance procedures so > far, so what Zaicheng describes should be the natural development direction > once the work is started. > > Regarding index level, I also think partition level index is more > important, but it seems like we have to first do file level as the > foundation. This leads to the index storage part. I am not talking about > using Parquet to store it, I am asking about what Miao is describing. I > don't think we have a consensus around the exact place to store index > information yet. My memory is that there are 2 ways: > 1. file level index stored as a binary field in manifest, partition level > index stored as a binary field in manifest list. This would only work for > small size indexes like bitmap (or bloom filter to certain extent) > 2. some sort of binary file to store index data, and index metadata (e.g. > index type) and pointer to the binary index data file is kept in 1 (I think > this is what Miao is describing) > 3. some sort of index spec to independently store index metadata and data, > similar to what we are proposing today for view > > Another aspect of index storage is the index file location in case of 2 > and 3. In the original doc a specific file path structure is proposed, > whereas this is a bit against the Iceberg standard of not assuming file > path to work with any storage. We also need more clarity in that topic. > > Best, > Jack Ye > > > On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <wcatp19891...@gmail.com> > wrote: > >> Thanks for having the thread. This is Zaicheng from bytedance. >> >> Initially we are planning to add index feature for our internal Trino and >> feel like iceberg could be the best place for holding/buiding the index >> data. >> We are very interested in having and contributing to this feature. >> (Pretty new to the community, still having my 2 cents) >> >> Echo on what Miao mentioned on 4): I feel iceberg could provide interface >> for creating/updating/deleting index and each engine can decide how to >> invoke these method (in a distributed manner or single thread manner, in >> async or sync). >> Take our use case as an example, we plan to have a new DDL syntax "create >> index id_1 on table col_1 using bloom"/"update index id_1 on table col_1", >> and our SQL engine will create distributed index creation/updating >> operator. Each operator will invoke the index related method provided by >> iceberg. >> >> Storage): Does the index data have to be a file? Wondering if we want to >> design the index data storage interface in such way that people can plugin >> different index storage(file storage/centralized index storage service) >> later on. >> >> Thanks, >> Zaicheng >> >> >> Miao Wang <miw...@adobe.com.invalid> 于2022年1月26日周三 10:22写道: >> >>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance >>> created a slack channel for index work. I suggested him adding Anton and >>> you to the channel. >>> >>> >>> >>> I still remember some conclusions from previous discussions. >>> >>> >>> >>> 1). Index types support: We planned to support Skipping Index first. >>> Iceberg metadata exposes hints whether the tracked data files have index >>> which reduces index reading overhead. Index file can be applied when >>> generating the scan task. >>> >>> >>> >>> 2). As Ryan mentioned, Sequence number will be used to indicate whether >>> an index is valid. Sequence number can link the data evolution with index >>> evolution. >>> >>> >>> >>> 3). Storage: We planned to have simple file format which includes Column >>> Name/ID, Index Type (String), Index content length, and binary content. It >>> is not necessary to use Parquet to store index. Initial thought was 1 data >>> file mapping to 1 index file. It can be merged to 1 partition mapping to 1 >>> index file. As Ryan said, file level implementation could be a step stone >>> for Partition level implementation. >>> >>> >>> >>> 4). How to build index: We want to keep the index reading and writing >>> interface with Iceberg and leave the actual building logic as Engine >>> specific (i.e., we can use different compute to build Index without >>> changing anything inside Iceberg). >>> >>> >>> >>> Misc: >>> >>> Huaxin implemented Index support API for DSv2 in Spark 3.x code base. >>> >>> Design doc: >>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit >>> >>> PR should have been merged. >>> >>> Guy from IBM did a partial PoC and provided a private doc. I will ask if >>> he can make it public. >>> >>> >>> >>> We can continue the discussion and breaking down the big tasks into >>> tickets. >>> >>> >>> >>> Thanks! >>> >>> >>> >>> Miao >>> >>> *From: *Ryan Blue <b...@tabular.io> >>> *Date: *Tuesday, January 25, 2022 at 5:08 PM >>> *To: *Iceberg Dev List <dev@iceberg.apache.org> >>> *Subject: *Re: Continuing the Secondary Index Discussion >>> >>> Thanks for raising this for discussion, Jack! It would be great to start >>> adding more indexes. >>> >>> >>> >>> > Scope of native index support >>> >>> >>> >>> The way I think about it, the biggest challenge here is how to know when >>> you can use an index. For example, if you have a partition index that is up >>> to date as of snapshot 13764091836784, but the current snapshot is >>> 97613097151667, then you basically have no idea what files are covered or >>> not and can't use it. On the other hand, if you know that the index was up >>> to date as of sequence number 11 and you're reading sequence number 12, >>> then you just have to read any data file that was written at sequence >>> number 12. >>> >>> >>> >>> The problem of where you can use an index makes me think that it is best >>> to maintain index metadata within Iceberg. An alternative is to try to >>> always keep the index up-to-date, but I don't think that's necessarily >>> possible -- you'd have to support index updates in every writer that >>> touches table data. You would have to spend the time updating indexes at >>> write time, but there are competing priorities like making data available. >>> So I think you want asynchronous index updates and that leads to >>> integration with the table format. >>> >>> >>> >>> > Index levels >>> >>> >>> >>> I think that partition-level indexes are better for job planning >>> (eliminate whole partitions!) but file-level are still useful for skipping >>> files at the task level. I would probably focus on partition-level, but I'm >>> not strongly opinionated here. File-level is probably a stepping stone to >>> partition-level, given that we would be able to track index data in the >>> same format. >>> >>> >>> >>> > Index storage >>> >>> >>> >>> Do you mean putting indexes in Parquet, or using Parquet for indexes? I >>> think that bloom filters would probably exceed the amount of data we'd want >>> to put into a Parquet binary column, probably at the file level and almost >>> certainly at the partition level, since the size depends on the number of >>> distinct values and the primary use is for identifiers. >>> >>> >>> >>> > Indexing process >>> >>> >>> >>> Synchronous is nice, but as I said above, I think we have to support >>> async because it is too complicated to update every writer that touches a >>> table and you may not want to pay the price at write time. >>> >>> >>> >>> > Index validation >>> >>> >>> >>> I think this is pretty much what I talked about for question 1. I think >>> that we have a good plan around using sequence numbers, if we want to do >>> this. >>> >>> >>> >>> Ryan >>> >>> >>> >>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>> Hi everyone, >>> >>> >>> >>> Based on the conversation in the last community sync and the Iceberg >>> Slack channel, it seems like multiple parties have interest in continuing >>> the effort related to the secondary index in Iceberg, so I would like to >>> restart the thread to continue the discussion. >>> >>> >>> >>> So far most people refer to the document authored by Miao Wang >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0> >>> which has a lot of useful information about the design and implementation. >>> However, the document is also quite old (over a year now) and a lot has >>> changed in Iceberg since then. I think the document leaves the following >>> open topics that we need to continue to address: >>> >>> >>> >>> 1. *scope of native index support*: what type of index should Iceberg >>> support natively, how should developers allocate effort between adding >>> support of Iceberg native index compared to developing Iceberg support for >>> holistic indexing projects such as HyperSpace >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0> >>> . >>> >>> >>> >>> 2. *index levels*: we have talked about partition level indexing and >>> file level indexing. More clarity is needed for these index levels and the >>> level of interest and support needed for those different indexing levels. >>> >>> >>> >>> 3. *index storage*: we had unsettled debates around making index >>> separated files or embedding it as a part of existing Iceberg file >>> structure. We need to come up with certain criteria such as index size, >>> easiness to generate during write, etc. to settle the discussion. >>> >>> >>> >>> 4. *Indexing process*: as stated in Miao's document, indexes could be >>> created during the data writing process synchronously, or built >>> asynchronously through an index service. Discussion is needed for the focus >>> of the Iceberg index functionalities. >>> >>> >>> >>> 5. *index invalidation*: depends on the scope and level, certain >>> indexes need to be invalidated during operations like RewriteFiles. Clarity >>> is needed in this domain, including if we need another sequence number to >>> track such invalidation. >>> >>> >>> >>> I suggest we iterate a bit on this list of open questions, and then we >>> can have a meeting to discuss those aspects, and produce an updated >>> document addressing those aspects to provide a clear path forward for >>> developers interested in adding features in this domain. >>> >>> >>> >>> Any thoughts? >>> >>> >>> >>> Best, >>> >>> Jack Ye >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Ryan Blue >>> >>> Tabular >>> >>