Re: Continuing the Secondary Index Discussion

Ryan Blue Tue, 25 Jan 2022 17:07:53 -0800

Thanks for raising this for discussion, Jack! It would be great to start
adding more indexes.

> Scope of native index support

The way I think about it, the biggest challenge here is how to know when
you can use an index. For example, if you have a partition index that is up
to date as of snapshot 13764091836784, but the current snapshot is
97613097151667, then you basically have no idea what files are covered or
not and can't use it. On the other hand, if you know that the index was up
to date as of sequence number 11 and you're reading sequence number 12,
then you just have to read any data file that was written at sequence
number 12.

The problem of where you can use an index makes me think that it is best to
maintain index metadata within Iceberg. An alternative is to try to always
keep the index up-to-date, but I don't think that's necessarily possible --
you'd have to support index updates in every writer that touches table
data. You would have to spend the time updating indexes at write time, but
there are competing priorities like making data available. So I think you
want asynchronous index updates and that leads to integration with the
table format.

> Index levels

I think that partition-level indexes are better for job planning (eliminate
whole partitions!) but file-level are still useful for skipping files at
the task level. I would probably focus on partition-level, but I'm not
strongly opinionated here. File-level is probably a stepping stone to
partition-level, given that we would be able to track index data in the
same format.

> Index storage

Do you mean putting indexes in Parquet, or using Parquet for indexes? I
think that bloom filters would probably exceed the amount of data we'd want
to put into a Parquet binary column, probably at the file level and almost
certainly at the partition level, since the size depends on the number of
distinct values and the primary use is for identifiers.

> Indexing process

Synchronous is nice, but as I said above, I think we have to support async
because it is too complicated to update every writer that touches a table
and you may not want to pay the price at write time.

> Index validation

I think this is pretty much what I talked about for question 1. I think
that we have a good plan around using sequence numbers, if we want to do
this.

Ryan

On Tue, Jan 25, 2022 at 3:23 PM Jack Ye <yezhao...@gmail.com> wrote:

> Hi everyone,
>
> Based on the conversation in the last community sync and the Iceberg Slack
> channel, it seems like multiple parties have interest in continuing the
> effort related to the secondary index in Iceberg, so I would like to
> restart the thread to continue the discussion.
>
> So far most people refer to the document authored by Miao Wang
> <https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit>
> which has a lot of useful information about the design and implementation.
> However, the document is also quite old (over a year now) and a lot has
> changed in Iceberg since then. I think the document leaves the following
> open topics that we need to continue to address:
>
> 1. *scope of native index support*: what type of index should Iceberg
> support natively, how should developers allocate effort between adding
> support of Iceberg native index compared to developing Iceberg support for
> holistic indexing projects such as HyperSpace
> <https://microsoft.github.io/hyperspace/>.
>
> 2. *index levels*: we have talked about partition level indexing and file
> level indexing. More clarity is needed for these index levels and the level
> of interest and support needed for those different indexing levels.
>
> 3. *index storage*: we had unsettled debates around making index
> separated files or embedding it as a part of existing Iceberg file
> structure. We need to come up with certain criteria such as index size,
> easiness to generate during write, etc. to settle the discussion.
>
> 4. *Indexing process*: as stated in Miao's document, indexes could be
> created during the data writing process synchronously, or built
> asynchronously through an index service. Discussion is needed for the focus
> of the Iceberg index functionalities.
>
> 5. *index invalidation*: depends on the scope and level, certain indexes
> need to be invalidated during operations like RewriteFiles. Clarity is
> needed in this domain, including if we need another sequence number to
> track such invalidation.
>
> I suggest we iterate a bit on this list of open questions, and then we can
> have a meeting to discuss those aspects, and produce an updated document
> addressing those aspects to provide a clear path forward for developers
> interested in adding features in this domain.
>
> Any thoughts?
>
> Best,
> Jack Ye
>
>

-- 
Ryan Blue
Tabular

Re: Continuing the Secondary Index Discussion

Reply via email to