Hi everyone,

Based on the conversation in the last community sync and the Iceberg Slack
channel, it seems like multiple parties have interest in continuing the
effort related to the secondary index in Iceberg, so I would like to
restart the thread to continue the discussion.

So far most people refer to the document authored by Miao Wang
<https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit>
which has a lot of useful information about the design and implementation.
However, the document is also quite old (over a year now) and a lot has
changed in Iceberg since then. I think the document leaves the following
open topics that we need to continue to address:

1. *scope of native index support*: what type of index should Iceberg
support natively, how should developers allocate effort between adding
support of Iceberg native index compared to developing Iceberg support for
holistic indexing projects such as HyperSpace
<https://microsoft.github.io/hyperspace/>.

2. *index levels*: we have talked about partition level indexing and file
level indexing. More clarity is needed for these index levels and the level
of interest and support needed for those different indexing levels.

3. *index storage*: we had unsettled debates around making index separated
files or embedding it as a part of existing Iceberg file structure. We need
to come up with certain criteria such as index size, easiness to generate
during write, etc. to settle the discussion.

4. *Indexing process*: as stated in Miao's document, indexes could be
created during the data writing process synchronously, or built
asynchronously through an index service. Discussion is needed for the focus
of the Iceberg index functionalities.

5. *index invalidation*: depends on the scope and level, certain indexes
need to be invalidated during operations like RewriteFiles. Clarity is
needed in this domain, including if we need another sequence number to
track such invalidation.

I suggest we iterate a bit on this list of open questions, and then we can
have a meeting to discuss those aspects, and produce an updated document
addressing those aspects to provide a clear path forward for developers
interested in adding features in this domain.

Any thoughts?

Best,
Jack Ye

Reply via email to