Re: Continuing the Secondary Index Discussion

Miao Wang Tue, 25 Jan 2022 18:22:03 -0800

Thanks Jack for resuming the discussion. Zaicheng from Byte Dance created a 
slack channel for index work. I suggested him adding Anton and you to the 
channel.

I still remember some conclusions from previous discussions.

1). Index types support: We planned to support Skipping Index first. Iceberg 
metadata exposes hints whether the tracked data files have index which reduces 
index reading overhead. Index file can be applied when generating the scan task.

2). As Ryan mentioned, Sequence number will be used to indicate whether an 
index is valid. Sequence number can link the data evolution with index 
evolution.

3). Storage: We planned to have simple file format which includes Column 
Name/ID, Index Type (String), Index content length, and binary content. It is 
not necessary to use Parquet to store index. Initial thought was 1 data file 
mapping to 1 index file. It can be merged to 1 partition mapping to 1 index 
file. As Ryan said, file level implementation could be a step stone for 
Partition level implementation.

4). How to build index: We want to keep the index reading and writing interface 
with Iceberg and leave the actual building logic as Engine specific (i.e., we 
can use different compute to build Index without changing anything inside 
Iceberg).

Misc:
Huaxin implemented Index support API for DSv2 in Spark 3.x code base.
Design doc: 
https://docs.google.com/document/d/1qnq1X08Zb4NjCm4Nl_XYjAofwUgXUB03WDLM61B3a_8/edit
PR should have been merged.
Guy from IBM did a partial PoC and provided a private doc. I will ask if he can 
make it public.

We can continue the discussion and breaking down the big tasks into tickets.

Thanks!

Miao
From: Ryan Blue <b...@tabular.io>
Date: Tuesday, January 25, 2022 at 5:08 PM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: Re: Continuing the Secondary Index Discussion
Thanks for raising this for discussion, Jack! It would be great to start adding 
more indexes.

> Scope of native index support

The way I think about it, the biggest challenge here is how to know when you 
can use an index. For example, if you have a partition index that is up to date 
as of snapshot 13764091836784, but the current snapshot is 97613097151667, then 
you basically have no idea what files are covered or not and can't use it. On 
the other hand, if you know that the index was up to date as of sequence number 
11 and you're reading sequence number 12, then you just have to read any data 
file that was written at sequence number 12.

The problem of where you can use an index makes me think that it is best to 
maintain index metadata within Iceberg. An alternative is to try to always keep 
the index up-to-date, but I don't think that's necessarily possible -- you'd 
have to support index updates in every writer that touches table data. You 
would have to spend the time updating indexes at write time, but there are 
competing priorities like making data available. So I think you want 
asynchronous index updates and that leads to integration with the table format.

> Index levels

I think that partition-level indexes are better for job planning (eliminate 
whole partitions!) but file-level are still useful for skipping files at the 
task level. I would probably focus on partition-level, but I'm not strongly 
opinionated here. File-level is probably a stepping stone to partition-level, 
given that we would be able to track index data in the same format.

> Index storage

Do you mean putting indexes in Parquet, or using Parquet for indexes? I think 
that bloom filters would probably exceed the amount of data we'd want to put 
into a Parquet binary column, probably at the file level and almost certainly 
at the partition level, since the size depends on the number of distinct values 
and the primary use is for identifiers.

> Indexing process

Synchronous is nice, but as I said above, I think we have to support async 
because it is too complicated to update every writer that touches a table and 
you may not want to pay the price at write time.

> Index validation

I think this is pretty much what I talked about for question 1. I think that we 
have a good plan around using sequence numbers, if we want to do this.

Ryan

On Tue, Jan 25, 2022 at 3:23 PM Jack Ye 
<yezhao...@gmail.com<mailto:yezhao...@gmail.com>> wrote:
Hi everyone,

Based on the conversation in the last community sync and the Iceberg Slack 
channel, it seems like multiple parties have interest in continuing the effort 
related to the secondary index in Iceberg, so I would like to restart the 
thread to continue the discussion.

So far most people refer to the document authored by Miao 
Wang<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ%2Fedit&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=F0Utme%2BkWNf68olRifjD%2BE%2FXN1vxkIaY%2F7v8Meiz1N4%3D&reserved=0>
 which has a lot of useful information about the design and implementation. 
However, the document is also quite old (over a year now) and a lot has changed 
in Iceberg since then. I think the document leaves the following open topics 
that we need to continue to address:

1. scope of native index support: what type of index should Iceberg support 
natively, how should developers allocate effort between adding support of 
Iceberg native index compared to developing Iceberg support for holistic 
indexing projects such as 
HyperSpace<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmicrosoft.github.io%2Fhyperspace%2F&data=04%7C01%7Cmiwang%40adobe.com%7Cf818943b13944011e28f08d9e0684690%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637787561291307113%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Jwlm%2Bp4hzbKQZj%2B3NKq%2BHMk42DnjJ2lMmF2WPNtWm90%3D&reserved=0>.

2. index levels: we have talked about partition level indexing and file level 
indexing. More clarity is needed for these index levels and the level of 
interest and support needed for those different indexing levels.

3. index storage: we had unsettled debates around making index separated files 
or embedding it as a part of existing Iceberg file structure. We need to come 
up with certain criteria such as index size, easiness to generate during write, 
etc. to settle the discussion.

4. Indexing process: as stated in Miao's document, indexes could be created 
during the data writing process synchronously, or built asynchronously through 
an index service. Discussion is needed for the focus of the Iceberg index 
functionalities.

5. index invalidation: depends on the scope and level, certain indexes need to 
be invalidated during operations like RewriteFiles. Clarity is needed in this 
domain, including if we need another sequence number to track such invalidation.

I suggest we iterate a bit on this list of open questions, and then we can have 
a meeting to discuss those aspects, and produce an updated document addressing 
those aspects to provide a clear path forward for developers interested in 
adding features in this domain.

Any thoughts?

Best,
Jack Ye

--
Ryan Blue
Tabular

Re: Continuing the Secondary Index Discussion

Reply via email to