Re: [External] Re: Continuing the Secondary Index Discussion

Zaicheng Wang Sat, 05 Mar 2022 00:33:19 -0800

Hi dev folks,

As discussed in the sync
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.z3dncl7gr8m1>
meeting, we will have a dedicated meeting on this topic.
I tentatively scheduled a meeting on 4PM, March 8th PST time. The meeting
link is https://meet.google.com/ttd-jzid-abp
Please let me know if the time does not work for you.


Thanks,
Zaicheng

zaicheng wang <wangzaich...@bytedance.com> 于2022年3月2日周三 21:17写道：

> Hi folks,
>
> This is Zaicheng from bytedance. We spend some time working on solving the
> index invalidation problem as we discussed in the dev email channel.
> And when we are working on the POC, we also realize there are some
> metadata changes that might be introduced.
> We put these details into a document:
>
> https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing
> The document includes two proposals for solving the index invalidation
> problem: one from @Jack Ye’s idea on introducing a new sequence number,
>  another one is by leveraging the current manifest entry structure. The
> document will also describe the corresponding table spec change.
> Please let me know if you have any thoughts. We could also discuss this
> during the sync meeting.
>
> Thanks,
> Zaicheng
>
> On Tue, Feb 1, 2022 at 8:51 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in Slack.
>>
>> The goal here is to have a monotonically increasing number that could be
>> used to detect what files have been newly added and should be indexed. This
>> is especially important to know how up-to-date an index is for each
>> partition.
>>
>> In a table without compaction, sequence number of files would continue to
>> increase. If we have indexed all files up to sequence number 3, we know
>> that the next indexing process needs to index all the files with sequence
>> number greater than 3. But during compaction, files will be rewritten with
>> the starting sequence number. During commit time the sequence number might
>> already gone much higher. For example, I start compaction at seq=3, and
>> when this is running for a few hours, there are 10 inserts done to the
>> table, and the current sequence number is 13. When I commit the compacted
>> data files, those files are essentially written to a sequence number older
>> than the latest. This breaks a lot of assumption like (1) I cannot just
>> find new data to index by calculating if the sequence number is higher than
>> certain value, (2) a reader cannot determine if an index could be used
>> based on the sequence number.
>>
>> The solution I was describing is to have another watermark that is
>> monotonically increasing regardless of compaction or not. So Compaction
>> would commit those files at seq=3, but the new watermark of those files are
>> at 14. Then we can use this new watermark for all the index operations.
>>
>> Best,
>> Jack Ye
>>
>>
>> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wcatp19891...@gmail.com>
>> wrote:
>>
>>> Hi Jack,
>>>
>>>
>>> Thanks for the summary and it helps me a lot.
>>>
>>> Trying to understand point 2 and having my 2 cents.
>>>
>>> *a mechanism for tracking file change is needed. Unfortunately sequence
>>> numbers cannot be used due to the introduction of compaction that rewrites
>>> files into a lower sequence number. Another monotonically increasing
>>> watermark for files has to be introduced for index change detection and
>>> invalidation.*
>>>
>>> Please let me know if I have some wrong/silly assumptions.
>>>
>>> So the *reason* we couldn't use sequence numbers as the validness
>>> indicator of the index is compaction. Before compaction (taking a very
>>> simple example), the data file and index file should have a mapping and the
>>> tableScan.planTask() is able to decide whether to use index purely by
>>> comparing sequence numbers (as well as index spec id, if we have one).
>>>
>>> After compaction, the tableScan.planTask() couldn't do so because data
>>> file 5 is compacted to a new data file with seq = 10. Thus wrong plan tasks
>>> might be returned.
>>>
>>> I wonder how an additional watermark only for the index could solve the
>>> problem?
>>>
>>>
>>> And based on my gut feeling, I feel we could somehow solve the problem
>>> with the current sequence number:
>>>
>>> *Option 1*: When compacting, we could compact those data files that
>>> index is up to date to one group, those files that index is stale/not exist
>>> to another group. (Just like what we are doing with the data file that are
>>> unpartitioned/partition spec id not match).
>>>
>>> The *pro* is that we could still leverage indexes for part of the data
>>> files, and we could reuse the sequence number.
>>>
>>> The *cons* are that the compaction might not reach the target size and
>>> we might still have small files.
>>>
>>> *Option 2*:
>>>
>>> Assume compaction is often triggered by data engineers and the
>>> compaction action is not so frequent. We could directly invalid all index
>>> files for those compacted. And the user needs to rebuild the index every
>>> time after compaction.
>>>
>>> *Pro*: Easy to implement, clear to understand.
>>>
>>> *Cons*: Relatively bad user experience. Waste some computing resources
>>> to redo some work.
>>>
>>> *Option 3*:
>>>
>>> We could leverage the engine's computing resource to always rebuild
>>> indexes during data compaction.
>>>
>>> *Pro*: User could leverage index after the data compaction.
>>>
>>> *Cons*: Rebuilding might take longer time/resources.
>>>
>>> *Option 3 alternative*: add a configuration property to compaction,
>>> control if the user wants to rebuild the index during compaction.
>>>
>>>
>>> Please let me know if you have any thoughts on this.
>>>
>>> Best,
>>>
>>> Zaicheng
>>>
>>> Jack Ye <yezhao...@gmail.com> 于2022年1月26日周三 13:17写道：
>>>
>>>> Thanks for the fast responses!
>>>>
>>>> Based on the conversations above, it sounds like we have the following
>>>> consensus:
>>>>
>>>> 1. asynchronous index creation is preferred, although synchronous index
>>>> creation is possible.
>>>> 2. a mechanism for tracking file change is needed. Unfortunately
>>>> sequence number cannot be used due to the introduction of compaction that
>>>> rewrites files into a lower sequence number. Another monotonically
>>>> increasing watermark for files has to be introduced for index change
>>>> detection and invalidation.
>>>> 3. index creation and maintenance procedures should be pluggable by
>>>> different engines. This should not be an issue because Iceberg has been
>>>> designing action interfaces for different table maintenance procedures so
>>>> far, so what Zaicheng describes should be the natural development direction
>>>> once the work is started.
>>>>
>>>> Regarding index level, I also think partition level index is more
>>>> important, but it seems like we have to first do file level as the
>>>> foundation. This leads to the index storage part. I am not talking about
>>>> using Parquet to store it, I am asking about what Miao is describing. I
>>>> don't think we have a consensus around the exact place to store index
>>>> information yet. My memory is that there are 2 ways:
>>>> 1. file level index stored as a binary field in manifest, partition
>>>> level index stored as a binary field in manifest list. This would only work
>>>> for small size indexes like bitmap (or bloom filter to certain extent)
>>>> 2. some sort of binary file to store index data, and index metadata
>>>> (e.g. index type) and pointer to the binary index data file is kept in 1 (I
>>>> think this is what Miao is describing)
>>>> 3. some sort of index spec to independently store index metadata and
>>>> data, similar to what we are proposing today for view
>>>>
>>>> Another aspect of index storage is the index file location in case of 2
>>>> and 3. In the original doc a specific file path structure is proposed,
>>>> whereas this is a bit against the Iceberg standard of not assuming file
>>>> path to work with any storage. We also need more clarity in that topic.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <wcatp19891...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for having the thread. This is Zaicheng from bytedance.
>>>>>
>>>>> Initially we are planning to add index feature for our internal Trino
>>>>> and feel like iceberg could be the best place for holding/buiding the 
>>>>> index
>>>>> data.
>>>>> We are very interested in having and contributing to this feature.
>>>>> (Pretty new to the community, still having my 2 cents)
>>>>>
>>>>> Echo on what Miao mentioned on 4): I feel iceberg could provide
>>>>> interface for creating/updating/deleting index and each engine can decide
>>>>> how to invoke these method (in a distributed manner or single thread
>>>>> manner, in async or sync).
>>>>> Take our use case as an example, we plan to have a new DDL syntax
>>>>> "create index id_1 on table col_1 using bloom"/"update index id_1 on table
>>>>> col_1", and our SQL engine will create distributed index creation/updating
>>>>> operator. Each operator will invoke the index related method provided by
>>>>> iceberg.
>>>>>
>>>>> Storage): Does the index data have to be a file? Wondering if we want
>>>>> to design the index data storage interface in such way that people can
>>>>> plugin different index storage(file storage/centralized index storage
>>>>> service) later on.
>>>>>
>>>>> Thanks,
>>>>> Zaicheng
>>>>>
>>>>>
>>>>> Miao Wang <miw...@adobe.com.invalid> 于2022年1月26日周三 10:22写道：
>>>>>
>>>>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance
>>>>>> created a slack channel for index work. I suggested him adding Anton and
>>>>>> you to the channel.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I still remember some conclusions from previous discussions.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1). Index types support: We planned to support Skipping Index first.
>>>>>> Iceberg metadata exposes hints whether the tracked data files have index
>>>>>> which reduces index reading overhead. Index file can be applied when
>>>>>> generating the scan task.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2). As Ryan mentioned, Sequence number will be used to indicate
>>>>>> whether an index is valid. Sequence number can link the data evolution 
>>>>>> with
>>>>>> index evolution.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 3). Storage: We planned to have simple file format which includes
>>>>>> Column Name/ID, Index Type (String), Index content length, and binary
>>>>>> content. It is not necessary to use Parquet to store index. Initial 
>>>>>> thought
>>>>>> was 1 data file mapping to 1 index file. It can be merged to 1 partition
>>>>>> mapping to 1 index file. As Ryan said, file level implementation could 
>>>>>> be a
>>>>>> step stone for Partition level implementation.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 4). How to build index: We want to keep the index reading and writing
>>>>>> interface with Iceberg and leave the actual building logic as Engine
>>>>>> specific (i.e., we can use different compute to build Index without
>>>>>> changing anything inside Iceberg).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Misc:
>>>>>>
>>>>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
>>>>>>
>>>>>> Design doc:
>>>>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>>>>>>
>>>>>> PR should have been merged.
>>>>>>
>>>>>> Guy from IBM did a partial PoC and provided a private doc. I will ask
>>>>>> if he can make it public.
>>>>>>
>>>>>>
>>>>>>
>>>>>> We can continue the discussion and breaking down the big tasks into
>>>>>> tickets.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Miao
>>>>>>
>>>>>> *From: *Ryan Blue <b...@tabular.io>
>>>>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM
>>>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org>
>>>>>> *Subject: *Re: Continuing the Secondary Index Discussion
>>>>>>
>>>>>> Thanks for raising this for discussion, Jack! It would be great to
>>>>>> start adding more indexes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> > Scope of native index support
>>>>>>
>>>>>>
>>>>>>
>>>>>> The way I think about it, the biggest challenge here is how to know
>>>>>> when you can use an index. For example, if you have a partition index 
>>>>>> that
>>>>>> is up to date as of snapshot 13764091836784, but the current snapshot is
>>>>>> 97613097151667, then you basically have no idea what files are covered or
>>>>>> not and can't use it. On the other hand, if you know that the index was 
>>>>>> up
>>>>>> to date as of sequence number 11 and you're reading sequence number 12,
>>>>>> then you just have to read any data file that was written at sequence
>>>>>> number 12.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The problem of where you can use an index makes me think that it is
>>>>>> best to maintain index metadata within Iceberg. An alternative is to try 
>>>>>> to
>>>>>> always keep the index up-to-date, but I don't think that's necessarily
>>>>>> possible -- you'd have to support index updates in every writer that
>>>>>> touches table data. You would have to spend the time updating indexes at
>>>>>> write time, but there are competing priorities like making data 
>>>>>> available.
>>>>>> So I think you want asynchronous index updates and that leads to
>>>>>> integration with the table format.
>>>>>>
>>>>>>
>>>>>>
>>>>>> > Index levels
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think that partition-level indexes are better for job planning
>>>>>> (eliminate whole partitions!) but file-level are still useful for 
>>>>>> skipping
>>>>>> files at the task level. I would probably focus on partition-level, but 
>>>>>> I'm
>>>>>> not strongly opinionated here. File-level is probably a stepping stone to
>>>>>> partition-level, given that we would be able to track index data in the
>>>>>> same format.
>>>>>>
>>>>>>
>>>>>>
>>>>>> > Index storage
>>>>>>
>>>>>>
>>>>>>
>>>>>> Do you mean putting indexes in Parquet, or using Parquet for indexes?
>>>>>> I think that bloom filters would probably exceed the amount of data we'd
>>>>>> want to put into a Parquet binary column, probably at the file level and
>>>>>> almost certainly at the partition level, since the size depends on the
>>>>>> number of distinct values and the primary use is for identifiers.
>>>>>>
>>>>>>
>>>>>>
>>>>>> > Indexing process
>>>>>>
>>>>>>
>>>>>>
>>>>>> Synchronous is nice, but as I said above, I think we have to support
>>>>>> async because it is too complicated to update every writer that touches a
>>>>>> table and you may not want to pay the price at write time.
>>>>>>
>>>>>>
>>>>>>
>>>>>> > Index validation
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think this is pretty much what I talked about for question 1. I
>>>>>> think that we have a good plan around using sequence numbers, if we want 
>>>>>> to
>>>>>> do this.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Based on the conversation in the last community sync and the Iceberg
>>>>>> Slack channel, it seems like multiple parties have interest in continuing
>>>>>> the effort related to the secondary index in Iceberg, so I would like to
>>>>>> restart the thread to continue the discussion.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So far most people refer to the document authored by Miao Wang
>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0>
>>>>>> which has a lot of useful information about the design and 
>>>>>> implementation.
>>>>>> However, the document is also quite old (over a year now) and a lot has
>>>>>> changed in Iceberg since then. I think the document leaves the following
>>>>>> open topics that we need to continue to address:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1. *scope of native index support*: what type of index should
>>>>>> Iceberg support natively, how should developers allocate effort between
>>>>>> adding support of Iceberg native index compared to developing Iceberg
>>>>>> support for holistic indexing projects such as HyperSpace
>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>
>>>>>> .
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2. *index levels*: we have talked about partition level indexing and
>>>>>> file level indexing. More clarity is needed for these index levels and 
>>>>>> the
>>>>>> level of interest and support needed for those different indexing levels.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 3. *index storage*: we had unsettled debates around making index
>>>>>> separated files or embedding it as a part of existing Iceberg file
>>>>>> structure. We need to come up with certain criteria such as index size,
>>>>>> easiness to generate during write, etc. to settle the discussion.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 4. *Indexing process*: as stated in Miao's document, indexes could
>>>>>> be created during the data writing process synchronously, or built
>>>>>> asynchronously through an index service. Discussion is needed for the 
>>>>>> focus
>>>>>> of the Iceberg index functionalities.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 5. *index invalidation*: depends on the scope and level, certain
>>>>>> indexes need to be invalidated during operations like RewriteFiles. 
>>>>>> Clarity
>>>>>> is needed in this domain, including if we need another sequence number to
>>>>>> track such invalidation.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I suggest we iterate a bit on this list of open questions, and then
>>>>>> we can have a meeting to discuss those aspects, and produce an updated
>>>>>> document addressing those aspects to provide a clear path forward for
>>>>>> developers interested in adding features in this domain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Ryan Blue
>>>>>>
>>>>>> Tabular
>>>>>>
>>>>>

Re: [External] Re: Continuing the Secondary Index Discussion

Reply via email to