Hi PF,

Sure, rescheduled the meeting to an CET friendly time.
The meeting now is scheduled on 9AM, March 11th, PST (6PM CST, March 11th).
The meeting link is meet.google.com/ttd-jzid-abp
Please feel free to slack me or tag me in the slack channel if anyone would
like to get a meeting invitation (or you could directly join the meeting).

Best,
Zaicheng



Piotr Findeisen <pi...@starburstdata.com> 于2022年3月7日周一 21:54写道:

> Hi Zaicheng,
>
> thanks for following up on this. I'm certainly interested.
> The proposed time doesn't work for me though, I'm in the CET time zone.
>
> Best,
> PF
>
>
> On Sat, Mar 5, 2022 at 9:33 AM Zaicheng Wang <wcatp19891...@gmail.com>
> wrote:
>
>> Hi dev folks,
>>
>> As discussed in the sync
>> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.z3dncl7gr8m1>
>> meeting, we will have a dedicated meeting on this topic.
>> I tentatively scheduled a meeting on 4PM, March 8th PST time. The meeting
>> link is https://meet.google.com/ttd-jzid-abp
>> Please let me know if the time does not work for you.
>>
>> Thanks,
>> Zaicheng
>>
>> zaicheng wang <wangzaich...@bytedance.com> 于2022年3月2日周三 21:17写道:
>>
>>> Hi folks,
>>>
>>> This is Zaicheng from bytedance. We spend some time working on solving
>>> the index invalidation problem as we discussed in the dev email channel.
>>> And when we are working on the POC, we also realize there are some
>>> metadata changes that might be introduced.
>>> We put these details into a document:
>>>
>>> https://docs.google.com/document/d/1hLCKNtnA94gFKjssQpqS_2qqAxrlwqq3f6agij_4Rm4/edit?usp=sharing
>>> The document includes two proposals for solving the index invalidation
>>> problem: one from @Jack Ye’s idea on introducing a new sequence number,
>>>  another one is by leveraging the current manifest entry structure. The
>>> document will also describe the corresponding table spec change.
>>> Please let me know if you have any thoughts. We could also discuss this
>>> during the sync meeting.
>>>
>>> Thanks,
>>> Zaicheng
>>>
>>> On Tue, Feb 1, 2022 at 8:51 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Hi Zaicheng, I cannot see your pictures, maybe we could discuss in
>>>> Slack.
>>>>
>>>> The goal here is to have a monotonically increasing number that could
>>>> be used to detect what files have been newly added and should be indexed.
>>>> This is especially important to know how up-to-date an index is for each
>>>> partition.
>>>>
>>>> In a table without compaction, sequence number of files would continue
>>>> to increase. If we have indexed all files up to sequence number 3, we know
>>>> that the next indexing process needs to index all the files with sequence
>>>> number greater than 3. But during compaction, files will be rewritten with
>>>> the starting sequence number. During commit time the sequence number might
>>>> already gone much higher. For example, I start compaction at seq=3, and
>>>> when this is running for a few hours, there are 10 inserts done to the
>>>> table, and the current sequence number is 13. When I commit the compacted
>>>> data files, those files are essentially written to a sequence number older
>>>> than the latest. This breaks a lot of assumption like (1) I cannot just
>>>> find new data to index by calculating if the sequence number is higher than
>>>> certain value, (2) a reader cannot determine if an index could be used
>>>> based on the sequence number.
>>>>
>>>> The solution I was describing is to have another watermark that is
>>>> monotonically increasing regardless of compaction or not. So Compaction
>>>> would commit those files at seq=3, but the new watermark of those files are
>>>> at 14. Then we can use this new watermark for all the index operations.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>> On Sat, Jan 29, 2022 at 5:07 AM Zaicheng Wang <wcatp19891...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Jack,
>>>>>
>>>>>
>>>>> Thanks for the summary and it helps me a lot.
>>>>>
>>>>> Trying to understand point 2 and having my 2 cents.
>>>>>
>>>>> *a mechanism for tracking file change is needed. Unfortunately
>>>>> sequence numbers cannot be used due to the introduction of compaction that
>>>>> rewrites files into a lower sequence number. Another monotonically
>>>>> increasing watermark for files has to be introduced for index change
>>>>> detection and invalidation.*
>>>>>
>>>>> Please let me know if I have some wrong/silly assumptions.
>>>>>
>>>>> So the *reason* we couldn't use sequence numbers as the validness
>>>>> indicator of the index is compaction. Before compaction (taking a very
>>>>> simple example), the data file and index file should have a mapping and 
>>>>> the
>>>>> tableScan.planTask() is able to decide whether to use index purely by
>>>>> comparing sequence numbers (as well as index spec id, if we have one).
>>>>>
>>>>> After compaction, the tableScan.planTask() couldn't do so because data
>>>>> file 5 is compacted to a new data file with seq = 10. Thus wrong plan 
>>>>> tasks
>>>>> might be returned.
>>>>>
>>>>> I wonder how an additional watermark only for the index could solve
>>>>> the problem?
>>>>>
>>>>>
>>>>> And based on my gut feeling, I feel we could somehow solve the problem
>>>>> with the current sequence number:
>>>>>
>>>>> *Option 1*: When compacting, we could compact those data files that
>>>>> index is up to date to one group, those files that index is stale/not 
>>>>> exist
>>>>> to another group. (Just like what we are doing with the data file that are
>>>>> unpartitioned/partition spec id not match).
>>>>>
>>>>> The *pro* is that we could still leverage indexes for part of the
>>>>> data files, and we could reuse the sequence number.
>>>>>
>>>>> The *cons* are that the compaction might not reach the target size
>>>>> and we might still have small files.
>>>>>
>>>>> *Option 2*:
>>>>>
>>>>> Assume compaction is often triggered by data engineers and the
>>>>> compaction action is not so frequent. We could directly invalid all index
>>>>> files for those compacted. And the user needs to rebuild the index every
>>>>> time after compaction.
>>>>>
>>>>> *Pro*: Easy to implement, clear to understand.
>>>>>
>>>>> *Cons*: Relatively bad user experience. Waste some computing
>>>>> resources to redo some work.
>>>>>
>>>>> *Option 3*:
>>>>>
>>>>> We could leverage the engine's computing resource to always rebuild
>>>>> indexes during data compaction.
>>>>>
>>>>> *Pro*: User could leverage index after the data compaction.
>>>>>
>>>>> *Cons*: Rebuilding might take longer time/resources.
>>>>>
>>>>> *Option 3 alternative*: add a configuration property to compaction,
>>>>> control if the user wants to rebuild the index during compaction.
>>>>>
>>>>>
>>>>> Please let me know if you have any thoughts on this.
>>>>>
>>>>> Best,
>>>>>
>>>>> Zaicheng
>>>>>
>>>>> Jack Ye <yezhao...@gmail.com> 于2022年1月26日周三 13:17写道:
>>>>>
>>>>>> Thanks for the fast responses!
>>>>>>
>>>>>> Based on the conversations above, it sounds like we have the
>>>>>> following consensus:
>>>>>>
>>>>>> 1. asynchronous index creation is preferred, although synchronous
>>>>>> index creation is possible.
>>>>>> 2. a mechanism for tracking file change is needed. Unfortunately
>>>>>> sequence number cannot be used due to the introduction of compaction that
>>>>>> rewrites files into a lower sequence number. Another monotonically
>>>>>> increasing watermark for files has to be introduced for index change
>>>>>> detection and invalidation.
>>>>>> 3. index creation and maintenance procedures should be pluggable by
>>>>>> different engines. This should not be an issue because Iceberg has been
>>>>>> designing action interfaces for different table maintenance procedures so
>>>>>> far, so what Zaicheng describes should be the natural development 
>>>>>> direction
>>>>>> once the work is started.
>>>>>>
>>>>>> Regarding index level, I also think partition level index is more
>>>>>> important, but it seems like we have to first do file level as the
>>>>>> foundation. This leads to the index storage part. I am not talking about
>>>>>> using Parquet to store it, I am asking about what Miao is describing. I
>>>>>> don't think we have a consensus around the exact place to store index
>>>>>> information yet. My memory is that there are 2 ways:
>>>>>> 1. file level index stored as a binary field in manifest, partition
>>>>>> level index stored as a binary field in manifest list. This would only 
>>>>>> work
>>>>>> for small size indexes like bitmap (or bloom filter to certain extent)
>>>>>> 2. some sort of binary file to store index data, and index metadata
>>>>>> (e.g. index type) and pointer to the binary index data file is kept in 1 
>>>>>> (I
>>>>>> think this is what Miao is describing)
>>>>>> 3. some sort of index spec to independently store index metadata and
>>>>>> data, similar to what we are proposing today for view
>>>>>>
>>>>>> Another aspect of index storage is the index file location in case of
>>>>>> 2 and 3. In the original doc a specific file path structure is proposed,
>>>>>> whereas this is a bit against the Iceberg standard of not assuming file
>>>>>> path to work with any storage. We also need more clarity in that topic.
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <
>>>>>> wcatp19891...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks for having the thread. This is Zaicheng from bytedance.
>>>>>>>
>>>>>>> Initially we are planning to add index feature for our internal
>>>>>>> Trino and feel like iceberg could be the best place for holding/buiding 
>>>>>>> the
>>>>>>> index data.
>>>>>>> We are very interested in having and contributing to this feature.
>>>>>>> (Pretty new to the community, still having my 2 cents)
>>>>>>>
>>>>>>> Echo on what Miao mentioned on 4): I feel iceberg could provide
>>>>>>> interface for creating/updating/deleting index and each engine can 
>>>>>>> decide
>>>>>>> how to invoke these method (in a distributed manner or single thread
>>>>>>> manner, in async or sync).
>>>>>>> Take our use case as an example, we plan to have a new DDL syntax
>>>>>>> "create index id_1 on table col_1 using bloom"/"update index id_1 on 
>>>>>>> table
>>>>>>> col_1", and our SQL engine will create distributed index 
>>>>>>> creation/updating
>>>>>>> operator. Each operator will invoke the index related method provided by
>>>>>>> iceberg.
>>>>>>>
>>>>>>> Storage): Does the index data have to be a file? Wondering if we
>>>>>>> want to design the index data storage interface in such way that people 
>>>>>>> can
>>>>>>> plugin different index storage(file storage/centralized index storage
>>>>>>> service) later on.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zaicheng
>>>>>>>
>>>>>>>
>>>>>>> Miao Wang <miw...@adobe.com.invalid> 于2022年1月26日周三 10:22写道:
>>>>>>>
>>>>>>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance
>>>>>>>> created a slack channel for index work. I suggested him adding Anton 
>>>>>>>> and
>>>>>>>> you to the channel.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I still remember some conclusions from previous discussions.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1). Index types support: We planned to support Skipping Index
>>>>>>>> first. Iceberg metadata exposes hints whether the tracked data files 
>>>>>>>> have
>>>>>>>> index which reduces index reading overhead. Index file can be applied 
>>>>>>>> when
>>>>>>>> generating the scan task.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2). As Ryan mentioned, Sequence number will be used to indicate
>>>>>>>> whether an index is valid. Sequence number can link the data evolution 
>>>>>>>> with
>>>>>>>> index evolution.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 3). Storage: We planned to have simple file format which includes
>>>>>>>> Column Name/ID, Index Type (String), Index content length, and binary
>>>>>>>> content. It is not necessary to use Parquet to store index. Initial 
>>>>>>>> thought
>>>>>>>> was 1 data file mapping to 1 index file. It can be merged to 1 
>>>>>>>> partition
>>>>>>>> mapping to 1 index file. As Ryan said, file level implementation could 
>>>>>>>> be a
>>>>>>>> step stone for Partition level implementation.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 4). How to build index: We want to keep the index reading and
>>>>>>>> writing interface with Iceberg and leave the actual building logic as
>>>>>>>> Engine specific (i.e., we can use different compute to build Index 
>>>>>>>> without
>>>>>>>> changing anything inside Iceberg).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Misc:
>>>>>>>>
>>>>>>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code
>>>>>>>> base.
>>>>>>>>
>>>>>>>> Design doc:
>>>>>>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>>>>>>>>
>>>>>>>> PR should have been merged.
>>>>>>>>
>>>>>>>> Guy from IBM did a partial PoC and provided a private doc. I will
>>>>>>>> ask if he can make it public.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We can continue the discussion and breaking down the big tasks into
>>>>>>>> tickets.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Miao
>>>>>>>>
>>>>>>>> *From: *Ryan Blue <b...@tabular.io>
>>>>>>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM
>>>>>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org>
>>>>>>>> *Subject: *Re: Continuing the Secondary Index Discussion
>>>>>>>>
>>>>>>>> Thanks for raising this for discussion, Jack! It would be great to
>>>>>>>> start adding more indexes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > Scope of native index support
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The way I think about it, the biggest challenge here is how to know
>>>>>>>> when you can use an index. For example, if you have a partition index 
>>>>>>>> that
>>>>>>>> is up to date as of snapshot 13764091836784, but the current snapshot 
>>>>>>>> is
>>>>>>>> 97613097151667, then you basically have no idea what files are covered 
>>>>>>>> or
>>>>>>>> not and can't use it. On the other hand, if you know that the index 
>>>>>>>> was up
>>>>>>>> to date as of sequence number 11 and you're reading sequence number 12,
>>>>>>>> then you just have to read any data file that was written at sequence
>>>>>>>> number 12.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The problem of where you can use an index makes me think that it is
>>>>>>>> best to maintain index metadata within Iceberg. An alternative is to 
>>>>>>>> try to
>>>>>>>> always keep the index up-to-date, but I don't think that's necessarily
>>>>>>>> possible -- you'd have to support index updates in every writer that
>>>>>>>> touches table data. You would have to spend the time updating indexes 
>>>>>>>> at
>>>>>>>> write time, but there are competing priorities like making data 
>>>>>>>> available.
>>>>>>>> So I think you want asynchronous index updates and that leads to
>>>>>>>> integration with the table format.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > Index levels
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I think that partition-level indexes are better for job planning
>>>>>>>> (eliminate whole partitions!) but file-level are still useful for 
>>>>>>>> skipping
>>>>>>>> files at the task level. I would probably focus on partition-level, 
>>>>>>>> but I'm
>>>>>>>> not strongly opinionated here. File-level is probably a stepping stone 
>>>>>>>> to
>>>>>>>> partition-level, given that we would be able to track index data in the
>>>>>>>> same format.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > Index storage
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Do you mean putting indexes in Parquet, or using Parquet for
>>>>>>>> indexes? I think that bloom filters would probably exceed the amount of
>>>>>>>> data we'd want to put into a Parquet binary column, probably at the 
>>>>>>>> file
>>>>>>>> level and almost certainly at the partition level, since the size 
>>>>>>>> depends
>>>>>>>> on the number of distinct values and the primary use is for 
>>>>>>>> identifiers.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > Indexing process
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Synchronous is nice, but as I said above, I think we have to
>>>>>>>> support async because it is too complicated to update every writer that
>>>>>>>> touches a table and you may not want to pay the price at write time.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > Index validation
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I think this is pretty much what I talked about for question 1. I
>>>>>>>> think that we have a good plan around using sequence numbers, if we 
>>>>>>>> want to
>>>>>>>> do this.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Based on the conversation in the last community sync and the
>>>>>>>> Iceberg Slack channel, it seems like multiple parties have interest in
>>>>>>>> continuing the effort related to the secondary index in Iceberg, so I 
>>>>>>>> would
>>>>>>>> like to restart the thread to continue the discussion.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> So far most people refer to the document authored by Miao Wang
>>>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0>
>>>>>>>> which has a lot of useful information about the design and 
>>>>>>>> implementation.
>>>>>>>> However, the document is also quite old (over a year now) and a lot has
>>>>>>>> changed in Iceberg since then. I think the document leaves the 
>>>>>>>> following
>>>>>>>> open topics that we need to continue to address:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1. *scope of native index support*: what type of index should
>>>>>>>> Iceberg support natively, how should developers allocate effort between
>>>>>>>> adding support of Iceberg native index compared to developing Iceberg
>>>>>>>> support for holistic indexing projects such as HyperSpace
>>>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>
>>>>>>>> .
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. *index levels*: we have talked about partition level indexing
>>>>>>>> and file level indexing. More clarity is needed for these index levels 
>>>>>>>> and
>>>>>>>> the level of interest and support needed for those different indexing
>>>>>>>> levels.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 3. *index storage*: we had unsettled debates around making index
>>>>>>>> separated files or embedding it as a part of existing Iceberg file
>>>>>>>> structure. We need to come up with certain criteria such as index size,
>>>>>>>> easiness to generate during write, etc. to settle the discussion.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 4. *Indexing process*: as stated in Miao's document, indexes could
>>>>>>>> be created during the data writing process synchronously, or built
>>>>>>>> asynchronously through an index service. Discussion is needed for the 
>>>>>>>> focus
>>>>>>>> of the Iceberg index functionalities.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 5. *index invalidation*: depends on the scope and level, certain
>>>>>>>> indexes need to be invalidated during operations like RewriteFiles. 
>>>>>>>> Clarity
>>>>>>>> is needed in this domain, including if we need another sequence number 
>>>>>>>> to
>>>>>>>> track such invalidation.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I suggest we iterate a bit on this list of open questions, and then
>>>>>>>> we can have a meeting to discuss those aspects, and produce an updated
>>>>>>>> document addressing those aspects to provide a clear path forward for
>>>>>>>> developers interested in adding features in this domain.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Any thoughts?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Ryan Blue
>>>>>>>>
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>

Reply via email to