Hi:
Sorry for the late reply, I have been busy recently. I've updated the
design with more details about your questions, and here is a summary:

> 1. Would there be only one delete vector per data file?
Yes. It's possible that we have multiple deletion vectors per very large
data file to further reduce write amplification, but I'm not sure if it's
over design.

> 2. Would this require merge of existing vectors and new deletes at write
time?
Yes. Merging two bitmaps would be quite efficient.

> 3. How would the data file for a vector be identified?
It will be stored in the manifest file. We will have one  entry for
deletion file, and we add an extra field `data_file_path` for the
associated data file path. See Changes to spec
<https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.p4vrosjzl14j>
for
details, and Write process
<https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.tft7a34rd2be>
for
example.

> 4. If multiple vectors are allowed, what is the plan for keeping the
number of delete vectors small?
I see multiple vectors per data file as an optimization for very large data
file, and I'm not sure if it's over design.

> 5. Would we allow writing multiple delete vectors into the same file?
I don't want to do that. Merging delete vectors into one file have two
concerns:

   - Write amplification.
   - It makes concurrent modification of data files difficult.

> 6. How would we track which files are affected by a combined file of
delete vectors?
Sorry, I don't quite get your point.

> 7. What are the details of the proposed file format?
I think roaring bitmap would be a good candidate, but other columnar
formats such as parquet, orc are also possible since they provided great
compression for boolean columns. I've mentioned it here
<https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.nrhcjanzai0v>

On Thu, Sep 21, 2023 at 4:53 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Ryan,
>
> Thanks for the feedback. Unfortunately, I was not able to join the
> Iceberg community sync meeting yesterday, I promise I will join the
> next ones.
>
> I think the proposal is very interesting and also the
> discussion/comments in the document. I agree that some points should
> be discussed further. I propose to update the document with your
> points/questions.
>
> Thanks !
>
> Regards
> JB
>
> On Thu, Sep 21, 2023 at 2:02 AM Ryan Blue <b...@tabular.io> wrote:
> >
> > Renjie, thanks for the proposal.
> >
> > We talked about this today in the Iceberg community sync and the general
> feedback was that we're excited work on this, but the proposal left a few
> areas unclear. There are a few decisions about how to manage the delete
> vectors that need to be added to the design. For example:
> > 1. Would there be only one delete vector per data file?
> > 2. Would this require merge of existing vectors and new deletes at write
> time?
> > 3. How would the data file for a vector be identified?
> > 4. If multiple vectors are allowed, what is the plan for keeping the
> number of delete vectors small?
> > 5. Would we allow writing multiple delete vectors into the same file?
> > 6. How would we track which files are affected by a combined file of
> delete vectors?
> > 7. What are the details of the proposed file format?
> >
> > In short, we just want to better understand how all this would work.
> >
> > Thanks!
> >
> > Ryan
> >
> >
> > On Mon, Sep 18, 2023 at 8:22 PM Renjie Liu <liurenjie2...@gmail.com>
> wrote:
> >>
> >> Hi, all:
> >>
> >>
> >>
> >> I have a proposal to introduce deletion vector file to reduce write
> amplification of iceberg table:
> >>
> >>
> https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit?usp=sharing
> >>
> >>
> >>
> >> Welcome to comment, and looking forward to hear your advice.
> >
> >
> >
> > --
> > Ryan Blue
> > Tabular
>


-- 
Renjie Liu
Software Engineer, MVAD

Reply via email to