Re: Continuing the Secondary Index Discussion

melin li Wed, 26 Jan 2022 06:25:19 -0800

[Data skipping Index to improve query performance]
https://github.com/apache/hudi/blob/920f45926a3112b6d045ca3b434bc7c4e55e5e3c/rfc/rfc-27/rfc-27.md


Jack Ye <yezhao...@gmail.com> 于2022年1月26日周三 13:17写道：

> Thanks for the fast responses!
>
> Based on the conversations above, it sounds like we have the following
> consensus:
>
> 1. asynchronous index creation is preferred, although synchronous index
> creation is possible.
> 2. a mechanism for tracking file change is needed. Unfortunately sequence
> number cannot be used due to the introduction of compaction that rewrites
> files into a lower sequence number. Another monotonically increasing
> watermark for files has to be introduced for index change detection and
> invalidation.
> 3. index creation and maintenance procedures should be pluggable by
> different engines. This should not be an issue because Iceberg has been
> designing action interfaces for different table maintenance procedures so
> far, so what Zaicheng describes should be the natural development direction
> once the work is started.
>
> Regarding index level, I also think partition level index is more
> important, but it seems like we have to first do file level as the
> foundation. This leads to the index storage part. I am not talking about
> using Parquet to store it, I am asking about what Miao is describing. I
> don't think we have a consensus around the exact place to store index
> information yet. My memory is that there are 2 ways:
> 1. file level index stored as a binary field in manifest, partition level
> index stored as a binary field in manifest list. This would only work for
> small size indexes like bitmap (or bloom filter to certain extent)
> 2. some sort of binary file to store index data, and index metadata (e.g.
> index type) and pointer to the binary index data file is kept in 1 (I think
> this is what Miao is describing)
> 3. some sort of index spec to independently store index metadata and data,
> similar to what we are proposing today for view
>
> Another aspect of index storage is the index file location in case of 2
> and 3. In the original doc a specific file path structure is proposed,
> whereas this is a bit against the Iceberg standard of not assuming file
> path to work with any storage. We also need more clarity in that topic.
>
> Best,
> Jack Ye
>
>
> On Tue, Jan 25, 2022 at 7:02 PM Zaicheng Wang <wcatp19891...@gmail.com>
> wrote:
>
>> Thanks for having the thread. This is Zaicheng from bytedance.
>>
>> Initially we are planning to add index feature for our internal Trino and
>> feel like iceberg could be the best place for holding/buiding the index
>> data.
>> We are very interested in having and contributing to this feature.
>> (Pretty new to the community, still having my 2 cents)
>>
>> Echo on what Miao mentioned on 4): I feel iceberg could provide interface
>> for creating/updating/deleting index and each engine can decide how to
>> invoke these method (in a distributed manner or single thread manner, in
>> async or sync).
>> Take our use case as an example, we plan to have a new DDL syntax "create
>> index id_1 on table col_1 using bloom"/"update index id_1 on table col_1",
>> and our SQL engine will create distributed index creation/updating
>> operator. Each operator will invoke the index related method provided by
>> iceberg.
>>
>> Storage): Does the index data have to be a file? Wondering if we want to
>> design the index data storage interface in such way that people can plugin
>> different index storage(file storage/centralized index storage service)
>> later on.
>>
>> Thanks,
>> Zaicheng
>>
>>
>> Miao Wang <miw...@adobe.com.invalid> 于2022年1月26日周三 10:22写道：
>>
>>> Thanks Jack for resuming the discussion. Zaicheng from Byte Dance
>>> created a slack channel for index work. I suggested him adding Anton and
>>> you to the channel.
>>>
>>>
>>>
>>> I still remember some conclusions from previous discussions.
>>>
>>>
>>>
>>> 1). Index types support: We planned to support Skipping Index first.
>>> Iceberg metadata exposes hints whether the tracked data files have index
>>> which reduces index reading overhead. Index file can be applied when
>>> generating the scan task.
>>>
>>>
>>>
>>> 2). As Ryan mentioned, Sequence number will be used to indicate whether
>>> an index is valid. Sequence number can link the data evolution with index
>>> evolution.
>>>
>>>
>>>
>>> 3). Storage: We planned to have simple file format which includes Column
>>> Name/ID, Index Type (String), Index content length, and binary content. It
>>> is not necessary to use Parquet to store index. Initial thought was 1 data
>>> file mapping to 1 index file. It can be merged to 1 partition mapping to 1
>>> index file. As Ryan said, file level implementation could be a step stone
>>> for Partition level implementation.
>>>
>>>
>>>
>>> 4). How to build index: We want to keep the index reading and writing
>>> interface with Iceberg and leave the actual building logic as Engine
>>> specific (i.e., we can use different compute to build Index without
>>> changing anything inside Iceberg).
>>>
>>>
>>>
>>> Misc:
>>>
>>> Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
>>>
>>> Design doc:
>>> https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
>>>
>>> PR should have been merged.
>>>
>>> Guy from IBM did a partial PoC and provided a private doc. I will ask if
>>> he can make it public.
>>>
>>>
>>>
>>> We can continue the discussion and breaking down the big tasks into
>>> tickets.
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Miao
>>>
>>> *From: *Ryan Blue <b...@tabular.io>
>>> *Date: *Tuesday, January 25, 2022 at 5:08 PM
>>> *To: *Iceberg Dev List <dev@iceberg.apache.org>
>>> *Subject: *Re: Continuing the Secondary Index Discussion
>>>
>>> Thanks for raising this for discussion, Jack! It would be great to start
>>> adding more indexes.
>>>
>>>
>>>
>>> > Scope of native index support
>>>
>>>
>>>
>>> The way I think about it, the biggest challenge here is how to know when
>>> you can use an index. For example, if you have a partition index that is up
>>> to date as of snapshot 13764091836784, but the current snapshot is
>>> 97613097151667, then you basically have no idea what files are covered or
>>> not and can't use it. On the other hand, if you know that the index was up
>>> to date as of sequence number 11 and you're reading sequence number 12,
>>> then you just have to read any data file that was written at sequence
>>> number 12.
>>>
>>>
>>>
>>> The problem of where you can use an index makes me think that it is best
>>> to maintain index metadata within Iceberg. An alternative is to try to
>>> always keep the index up-to-date, but I don't think that's necessarily
>>> possible -- you'd have to support index updates in every writer that
>>> touches table data. You would have to spend the time updating indexes at
>>> write time, but there are competing priorities like making data available.
>>> So I think you want asynchronous index updates and that leads to
>>> integration with the table format.
>>>
>>>
>>>
>>> > Index levels
>>>
>>>
>>>
>>> I think that partition-level indexes are better for job planning
>>> (eliminate whole partitions!) but file-level are still useful for skipping
>>> files at the task level. I would probably focus on partition-level, but I'm
>>> not strongly opinionated here. File-level is probably a stepping stone to
>>> partition-level, given that we would be able to track index data in the
>>> same format.
>>>
>>>
>>>
>>> > Index storage
>>>
>>>
>>>
>>> Do you mean putting indexes in Parquet, or using Parquet for indexes? I
>>> think that bloom filters would probably exceed the amount of data we'd want
>>> to put into a Parquet binary column, probably at the file level and almost
>>> certainly at the partition level, since the size depends on the number of
>>> distinct values and the primary use is for identifiers.
>>>
>>>
>>>
>>> > Indexing process
>>>
>>>
>>>
>>> Synchronous is nice, but as I said above, I think we have to support
>>> async because it is too complicated to update every writer that touches a
>>> table and you may not want to pay the price at write time.
>>>
>>>
>>>
>>> > Index validation
>>>
>>>
>>>
>>> I think this is pretty much what I talked about for question 1. I think
>>> that we have a good plan around using sequence numbers, if we want to do
>>> this.
>>>
>>>
>>>
>>> Ryan
>>>
>>>
>>>
>>> On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>> Hi everyone,
>>>
>>>
>>>
>>> Based on the conversation in the last community sync and the Iceberg
>>> Slack channel, it seems like multiple parties have interest in continuing
>>> the effort related to the secondary index in Iceberg, so I would like to
>>> restart the thread to continue the discussion.
>>>
>>>
>>>
>>> So far most people refer to the document authored by Miao Wang
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0>
>>> which has a lot of useful information about the design and implementation.
>>> However, the document is also quite old (over a year now) and a lot has
>>> changed in Iceberg since then. I think the document leaves the following
>>> open topics that we need to continue to address:
>>>
>>>
>>>
>>> 1. *scope of native index support*: what type of index should Iceberg
>>> support natively, how should developers allocate effort between adding
>>> support of Iceberg native index compared to developing Iceberg support for
>>> holistic indexing projects such as HyperSpace
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>
>>> .
>>>
>>>
>>>
>>> 2. *index levels*: we have talked about partition level indexing and
>>> file level indexing. More clarity is needed for these index levels and the
>>> level of interest and support needed for those different indexing levels.
>>>
>>>
>>>
>>> 3. *index storage*: we had unsettled debates around making index
>>> separated files or embedding it as a part of existing Iceberg file
>>> structure. We need to come up with certain criteria such as index size,
>>> easiness to generate during write, etc. to settle the discussion.
>>>
>>>
>>>
>>> 4. *Indexing process*: as stated in Miao's document, indexes could be
>>> created during the data writing process synchronously, or built
>>> asynchronously through an index service. Discussion is needed for the focus
>>> of the Iceberg index functionalities.
>>>
>>>
>>>
>>> 5. *index invalidation*: depends on the scope and level, certain
>>> indexes need to be invalidated during operations like RewriteFiles. Clarity
>>> is needed in this domain, including if we need another sequence number to
>>> track such invalidation.
>>>
>>>
>>>
>>> I suggest we iterate a bit on this list of open questions, and then we
>>> can have a meeting to discuss those aspects, and produce an updated
>>> document addressing those aspects to provide a clear path forward for
>>> developers interested in adding features in this domain.
>>>
>>>
>>>
>>> Any thoughts?
>>>
>>>
>>>
>>> Best,
>>>
>>> Jack Ye
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Tabular
>>>
>>

Re: Continuing the Secondary Index Discussion

Reply via email to