Hi Zaicheng,

thanks for following up on this. I'm certainly interested.
The proposed time doesn't work for me though, I'm in the CET time zone.


On Sat, Mar 5, 2022 at 9:33 AM Zaicheng Wang <wcatp19891...@gmail.com>

> Hi dev folks,
> As discussed in the sync
> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.z3dncl7gr8m1>
> meeting, we will have a dedicated meeting on this topic.
> I tentatively scheduled a meeting on 4PM, March 8th PST time. The meeting
> link is https://meet.google.com/ttd-jzid-abp
> Please let me know if the time does not work for you.
> Thanks,
> Zaicheng
> zaicheng wang <wangzaich...@bytedance.com> 于2022年3月2日周三 21:17写道:
>> Hi folks,
>> This is Zaicheng from bytedance. We spend some time working on solving
>> the index invalidation problem as we discussed in the dev email channel.
>> And when we are working on the POC, we also realize there are some
>> metadata changes that might be introduced.
>> We put these details into a document:
>> https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing
>> The document includes two proposals for solving the index invalidation
>> problem: one from @Jack Ye’s idea on introducing a new sequence number,
>>  another one is by leveraging the current manifest entry structure. The
>> document will also describe the corresponding table spec change.
>> Please let me know if you have any thoughts. We could also discuss this
>> during the sync meeting.
>> Thanks,
>> Zaicheng
>> On Tue, Feb 1, 2022 at 8:51 AM Jack Ye <yezhao...@gmail.com> wrote:
>>> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in Slack.
>>> The goal here is to have a monotonically increasing number that could be
>>> used to detect what files have been newly added and should be indexed. This
>>> is especially important to know how up-to-date an index is for each
>>> partition.
>>> In a table without compaction, sequence number of files would continue
>>> to increase. If we have indexed all files up to sequence number 3, we know
>>> that the next indexing process needs to index all the files with sequence
>>> number greater than 3. But during compaction, files will be rewritten with
>>> the starting sequence number. During commit time the sequence number might
>>> already gone much higher. For example, I start compaction at seq=3, and
>>> when this is running for a few hours, there are 10 inserts done to the
>>> table, and the current sequence number is 13. When I commit the compacted
>>> data files, those files are essentially written to a sequence number older
>>> than the latest. This breaks a lot of assumption like (1) I cannot just
>>> find new data to index by calculating if the sequence number is higher than
>>> certain value, (2) a reader cannot determine if an index could be used
>>> based on the sequence number.
>>> The solution I was describing is to have another watermark that is
>>> monotonically increasing regardless of compaction or not. So Compaction
>>> would commit those files at seq=3, but the new watermark of those files are
>>> at 14. Then we can use this new watermark for all the index operations.
>>> Best,
>>> Jack Ye
>>> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wcatp19891...@gmail.com>
>>> wrote:
>>>> Hi Jack,
>>>> Thanks for the summary and it helps me a lot.
>>>> Trying to understand point 2 and having my 2 cents.
>>>> *a mechanism for tracking file change is needed. Unfortunately sequence
>>>> numbers cannot be used due to the introduction of compaction that rewrites
>>>> files into a lower sequence number. Another monotonically increasing
>>>> watermark for files has to be introduced for index change detection and
>>>> invalidation.*
>>>> Please let me know if I have some wrong/silly assumptions.
>>>> So the *reason* we couldn't use sequence numbers as the validness
>>>> indicator of the index is compaction. Before compaction (taking a very
>>>> simple example), the data file and index file should have a mapping and the
>>>> tableScan.planTask() is able to decide whether to use index purely by
>>>> comparing sequence numbers (as well as index spec id, if we have one).
>>>> After compaction, the tableScan.planTask() couldn't do so because data
>>>> file 5 is compacted to a new data file with seq = 10. Thus wrong plan tasks
>>>> might be returned.
>>>> I wonder how an additional watermark only for the index could solve the
>>>> problem?
>>>> And based on my gut feeling, I feel we could somehow solve the problem
>>>> with the current sequence number:
>>>> *Option 1*: When compacting, we could compact those data files that
>>>> index is up to date to one group, those files that index is stale/not exist
>>>> to another group. (Just like what we are doing with the data file that are
>>>> unpartitioned/partition spec id not match).
>>>> The *pro* is that we could still leverage indexes for part of the data
>>>> files, and we could reuse the sequence number.
>>>> The *cons* are that the compaction might not reach the target size and
>>>> we might still have small files.
>>>> *Option 2*:
>>>> Assume compaction is often triggered by data engineers and the
>>>> compaction action is not so frequent. We could directly invalid all index
>>>> files for those compacted. And the user needs to rebuild the index every
>>>> time after compaction.
>>>> *Pro*: Easy to implement, clear to understand.
>>>> *Cons*: Relatively bad user experience. Waste some computing resources
>>>> to redo some work.
>>>> *Option 3*:
>>>> We could leverage the engine's computing resource to always rebuild
>>>> indexes during data compaction.
>>>> *Pro*: User could leverage index after the data compaction.
>>>> *Cons*: Rebuilding might take longer time/resources.
>>>> *Option 3 alternative*: add a configuration property to compaction,
>>>> control if the user wants to rebuild the index during compaction.
>>>> Please let me know if you have any thoughts on this.
>>>> Best,
>>>> Zaicheng
>>>> Jack Ye <yezhao...@gmail.com> 于2022年1月26日周三 13:17写道:
>>>>> Thanks for the fast responses!
>>>>> Based on the conversations above, it sounds like we have the following
>>>>> consensus:
>>>>> 1. asynchronous index creation is preferred, although synchronous
>>>>> index creation is possible.
>>>>> 2. a mechanism for tracking file change is needed. Unfortunately
>>>>> sequence number cannot be used due to the introduction of compaction that
>>>>> rewrites files into a lower sequence number. Another monotonically
>>>>> increasing watermark for files has to be introduced for index change
>>>>> detection and invalidation.
>>>>> 3. index creation and maintenance procedures should be pluggable by
>>>>> different engines. This should not be an issue because Iceberg has been
>>>>> designing action interfaces for different table maintenance procedures so
>>>>> far, so what Zaicheng describes should be the natural development 
>>>>> direction
>>>>> once the work is started.
>>>>> Regarding index level, I also think partition level index is more
>>>>> important, but it seems like we have to first do file level as the
>>>>> foundation. This leads to the index storage part. I am not talking about
>>>>> using Parquet to store it, I am asking about what Miao is describing. I
>>>>> don't think we have a consensus around the exact place to store index
>>>>> information yet. My memory is that there are 2 ways:
>>>>> 1. file level index stored as a binary field in manifest, partition
>>>>> level index stored as a binary field in manifest list. This would only 
>>>>> work
>>>>> for small size indexes like bitmap (or bloom filter to certain extent)
>>>>> 2. some sort of binary file to store index data, and index metadata
>>>>> (e.g. index type) and pointer to the binary index data file is kept in 1 
>>>>> (I
>>>>> think this is what Miao is describing)
>>>>> 3. some sort of index spec to independently store index metadata and
>>>>> data, similar to what we are proposing today for view
>>>>> Another aspect of index storage is the index file location in case of
>>>>> 2 and 3. In the original doc a specific file path structure is proposed,
>>>>> whereas this is a bit against the Iceberg standard of not assuming file
>>>>> path to work with any storage. We also need more clarity in that topic.
>>>>> Best,
>>>>> Jack Ye
>>>>> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <wcatp19891...@gmail.com>
>>>>> wrote:
>>>>>> Thanks for having the thread. This is Zaicheng from bytedance.
>>>>>> Initially we are planning to add index feature for our internal Trino
>>>>>> and feel like iceberg could be the best place for holding/buiding the 
>>>>>> index
>>>>>> data.
>>>>>> We are very interested in having and contributing to this feature.
>>>>>> (Pretty new to the community, still having my 2 cents)
>>>>>> Echo on what Miao mentioned on 4): I feel iceberg could provide
>>>>>> interface for creating/updating/deleting index and each engine can decide
>>>>>> how to invoke these method (in a distributed manner or single thread
>>>>>> manner, in async or sync).
>>>>>> Take our use case as an example, we plan to have a new DDL syntax
>>>>>> "create index id_1 on table col_1 using bloom"/"update index id_1 on 
>>>>>> table
>>>>>> col_1", and our SQL engine will create distributed index 
>>>>>> creation/updating
>>>>>> operator. Each operator will invoke the index related method provided by
>>>>>> iceberg.
>>>>>> Storage): Does the index data have to be a file? Wondering if we want
>>>>>> to design the index data storage interface in such way that people can
>>>>>> plugin different index storage(file storage/centralized index storage
>>>>>> service) later on.
>>>>>> Thanks,
>>>>>> Zaicheng
>>>>>> Miao Wang <miw...@adobe.com.invalid> 于2022年1月26日周三 10:22写道:
>>>>>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance
>>>>>>> created a slack channel for index work. I suggested him adding Anton and
>>>>>>> you to the channel.
>>>>>>> I still remember some conclusions from previous discussions.
>>>>>>> 1). Index types support: We planned to support Skipping Index first.
>>>>>>> Iceberg metadata exposes hints whether the tracked data files have index
>>>>>>> which reduces index reading overhead. Index file can be applied when
>>>>>>> generating the scan task.
>>>>>>> 2). As Ryan mentioned, Sequence number will be used to indicate
>>>>>>> whether an index is valid. Sequence number can link the data evolution 
>>>>>>> with
>>>>>>> index evolution.
>>>>>>> 3). Storage: We planned to have simple file format which includes
>>>>>>> Column Name/ID, Index Type (String), Index content length, and binary
>>>>>>> content. It is not necessary to use Parquet to store index. Initial 
>>>>>>> thought
>>>>>>> was 1 data file mapping to 1 index file. It can be merged to 1 partition
>>>>>>> mapping to 1 index file. As Ryan said, file level implementation could 
>>>>>>> be a
>>>>>>> step stone for Partition level implementation.
>>>>>>> 4). How to build index: We want to keep the index reading and
>>>>>>> writing interface with Iceberg and leave the actual building logic as
>>>>>>> Engine specific (i.e., we can use different compute to build Index 
>>>>>>> without
>>>>>>> changing anything inside Iceberg).
>>>>>>> Misc:
>>>>>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code
>>>>>>> base.
>>>>>>> Design doc:
>>>>>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>>>>>>> PR should have been merged.
>>>>>>> Guy from IBM did a partial PoC and provided a private doc. I will
>>>>>>> ask if he can make it public.
>>>>>>> We can continue the discussion and breaking down the big tasks into
>>>>>>> tickets.
>>>>>>> Thanks!
>>>>>>> Miao
>>>>>>> *From: *Ryan Blue <b...@tabular.io>
>>>>>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM
>>>>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org>
>>>>>>> *Subject: *Re: Continuing the Secondary Index Discussion
>>>>>>> Thanks for raising this for discussion, Jack! It would be great to
>>>>>>> start adding more indexes.
>>>>>>> > Scope of native index support
>>>>>>> The way I think about it, the biggest challenge here is how to know
>>>>>>> when you can use an index. For example, if you have a partition index 
>>>>>>> that
>>>>>>> is up to date as of snapshot 13764091836784, but the current snapshot is
>>>>>>> 97613097151667, then you basically have no idea what files are covered 
>>>>>>> or
>>>>>>> not and can't use it. On the other hand, if you know that the index was 
>>>>>>> up
>>>>>>> to date as of sequence number 11 and you're reading sequence number 12,
>>>>>>> then you just have to read any data file that was written at sequence
>>>>>>> number 12.
>>>>>>> The problem of where you can use an index makes me think that it is
>>>>>>> best to maintain index metadata within Iceberg. An alternative is to 
>>>>>>> try to
>>>>>>> always keep the index up-to-date, but I don't think that's necessarily
>>>>>>> possible -- you'd have to support index updates in every writer that
>>>>>>> touches table data. You would have to spend the time updating indexes at
>>>>>>> write time, but there are competing priorities like making data 
>>>>>>> available.
>>>>>>> So I think you want asynchronous index updates and that leads to
>>>>>>> integration with the table format.
>>>>>>> > Index levels
>>>>>>> I think that partition-level indexes are better for job planning
>>>>>>> (eliminate whole partitions!) but file-level are still useful for 
>>>>>>> skipping
>>>>>>> files at the task level. I would probably focus on partition-level, but 
>>>>>>> I'm
>>>>>>> not strongly opinionated here. File-level is probably a stepping stone 
>>>>>>> to
>>>>>>> partition-level, given that we would be able to track index data in the
>>>>>>> same format.
>>>>>>> > Index storage
>>>>>>> Do you mean putting indexes in Parquet, or using Parquet for
>>>>>>> indexes? I think that bloom filters would probably exceed the amount of
>>>>>>> data we'd want to put into a Parquet binary column, probably at the file
>>>>>>> level and almost certainly at the partition level, since the size 
>>>>>>> depends
>>>>>>> on the number of distinct values and the primary use is for identifiers.
>>>>>>> > Indexing process
>>>>>>> Synchronous is nice, but as I said above, I think we have to support
>>>>>>> async because it is too complicated to update every writer that touches 
>>>>>>> a
>>>>>>> table and you may not want to pay the price at write time.
>>>>>>> > Index validation
>>>>>>> I think this is pretty much what I talked about for question 1. I
>>>>>>> think that we have a good plan around using sequence numbers, if we 
>>>>>>> want to
>>>>>>> do this.
>>>>>>> Ryan
>>>>>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>> Hi everyone,
>>>>>>> Based on the conversation in the last community sync and the Iceberg
>>>>>>> Slack channel, it seems like multiple parties have interest in 
>>>>>>> continuing
>>>>>>> the effort related to the secondary index in Iceberg, so I would like to
>>>>>>> restart the thread to continue the discussion.
>>>>>>> So far most people refer to the document authored by Miao Wang
>>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0>
>>>>>>> which has a lot of useful information about the design and 
>>>>>>> implementation.
>>>>>>> However, the document is also quite old (over a year now) and a lot has
>>>>>>> changed in Iceberg since then. I think the document leaves the following
>>>>>>> open topics that we need to continue to address:
>>>>>>> 1. *scope of native index support*: what type of index should
>>>>>>> Iceberg support natively, how should developers allocate effort between
>>>>>>> adding support of Iceberg native index compared to developing Iceberg
>>>>>>> support for holistic indexing projects such as HyperSpace
>>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>
>>>>>>> .
>>>>>>> 2. *index levels*: we have talked about partition level indexing
>>>>>>> and file level indexing. More clarity is needed for these index levels 
>>>>>>> and
>>>>>>> the level of interest and support needed for those different indexing
>>>>>>> levels.
>>>>>>> 3. *index storage*: we had unsettled debates around making index
>>>>>>> separated files or embedding it as a part of existing Iceberg file
>>>>>>> structure. We need to come up with certain criteria such as index size,
>>>>>>> easiness to generate during write, etc. to settle the discussion.
>>>>>>> 4. *Indexing process*: as stated in Miao's document, indexes could
>>>>>>> be created during the data writing process synchronously, or built
>>>>>>> asynchronously through an index service. Discussion is needed for the 
>>>>>>> focus
>>>>>>> of the Iceberg index functionalities.
>>>>>>> 5. *index invalidation*: depends on the scope and level, certain
>>>>>>> indexes need to be invalidated during operations like RewriteFiles. 
>>>>>>> Clarity
>>>>>>> is needed in this domain, including if we need another sequence number 
>>>>>>> to
>>>>>>> track such invalidation.
>>>>>>> I suggest we iterate a bit on this list of open questions, and then
>>>>>>> we can have a meeting to discuss those aspects, and produce an updated
>>>>>>> document addressing those aspects to provide a clear path forward for
>>>>>>> developers interested in adding features in this domain.
>>>>>>> Any thoughts?
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular

Reply via email to