Hi:
I have addressed most comments in the document. I would like to ask what's
the next step? Should we have a vote on this spec to reject it or we should
go on with it?

On Sat, Sep 30, 2023 at 11:20 PM Renjie Liu <liurenjie2...@gmail.com> wrote:

> Hi:
> Sorry for the late reply, I have been busy recently. I've updated the
> design with more details about your questions, and here is a summary:
>
> > 1. Would there be only one delete vector per data file?
> Yes. It's possible that we have multiple deletion vectors per very large
> data file to further reduce write amplification, but I'm not sure if it's
> over design.
>
> > 2. Would this require merge of existing vectors and new deletes at write
> time?
> Yes. Merging two bitmaps would be quite efficient.
>
> > 3. How would the data file for a vector be identified?
> It will be stored in the manifest file. We will have one  entry for
> deletion file, and we add an extra field `data_file_path` for the
> associated data file path. See Changes to spec
> <https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.p4vrosjzl14j>
>  for
> details, and Write process
> <https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.tft7a34rd2be>
>  for
> example.
>
> > 4. If multiple vectors are allowed, what is the plan for keeping the
> number of delete vectors small?
> I see multiple vectors per data file as an optimization for very large
> data file, and I'm not sure if it's over design.
>
> > 5. Would we allow writing multiple delete vectors into the same file?
> I don't want to do that. Merging delete vectors into one file have two
> concerns:
>
>    - Write amplification.
>    - It makes concurrent modification of data files difficult.
>
> > 6. How would we track which files are affected by a combined file of
> delete vectors?
> Sorry, I don't quite get your point.
>
> > 7. What are the details of the proposed file format?
> I think roaring bitmap would be a good candidate, but other columnar
> formats such as parquet, orc are also possible since they provided great
> compression for boolean columns. I've mentioned it here
> <https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.nrhcjanzai0v>
>
> On Thu, Sep 21, 2023 at 4:53 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi Ryan,
>>
>> Thanks for the feedback. Unfortunately, I was not able to join the
>> Iceberg community sync meeting yesterday, I promise I will join the
>> next ones.
>>
>> I think the proposal is very interesting and also the
>> discussion/comments in the document. I agree that some points should
>> be discussed further. I propose to update the document with your
>> points/questions.
>>
>> Thanks !
>>
>> Regards
>> JB
>>
>> On Thu, Sep 21, 2023 at 2:02 AM Ryan Blue <b...@tabular.io> wrote:
>> >
>> > Renjie, thanks for the proposal.
>> >
>> > We talked about this today in the Iceberg community sync and the
>> general feedback was that we're excited work on this, but the proposal left
>> a few areas unclear. There are a few decisions about how to manage the
>> delete vectors that need to be added to the design. For example:
>> > 1. Would there be only one delete vector per data file?
>> > 2. Would this require merge of existing vectors and new deletes at
>> write time?
>> > 3. How would the data file for a vector be identified?
>> > 4. If multiple vectors are allowed, what is the plan for keeping the
>> number of delete vectors small?
>> > 5. Would we allow writing multiple delete vectors into the same file?
>> > 6. How would we track which files are affected by a combined file of
>> delete vectors?
>> > 7. What are the details of the proposed file format?
>> >
>> > In short, we just want to better understand how all this would work.
>> >
>> > Thanks!
>> >
>> > Ryan
>> >
>> >
>> > On Mon, Sep 18, 2023 at 8:22 PM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>> >>
>> >> Hi, all:
>> >>
>> >>
>> >>
>> >> I have a proposal to introduce deletion vector file to reduce write
>> amplification of iceberg table:
>> >>
>> >>
>> https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit?usp=sharing
>> >>
>> >>
>> >>
>> >> Welcome to comment, and looking forward to hear your advice.
>> >
>> >
>> >
>> > --
>> > Ryan Blue
>> > Tabular
>>
>
>
> --
> Renjie Liu
> Software Engineer, MVAD
>


-- 
Renjie Liu
Software Engineer, MVAD

Reply via email to