Hi: I have addressed most comments in the document. I would like to ask what's the next step? Should we have a vote on this spec to reject it or we should go on with it?
On Sat, Sep 30, 2023 at 11:20 PM Renjie Liu <liurenjie2...@gmail.com> wrote: > Hi: > Sorry for the late reply, I have been busy recently. I've updated the > design with more details about your questions, and here is a summary: > > > 1. Would there be only one delete vector per data file? > Yes. It's possible that we have multiple deletion vectors per very large > data file to further reduce write amplification, but I'm not sure if it's > over design. > > > 2. Would this require merge of existing vectors and new deletes at write > time? > Yes. Merging two bitmaps would be quite efficient. > > > 3. How would the data file for a vector be identified? > It will be stored in the manifest file. We will have one entry for > deletion file, and we add an extra field `data_file_path` for the > associated data file path. See Changes to spec > <https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.p4vrosjzl14j> > for > details, and Write process > <https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.tft7a34rd2be> > for > example. > > > 4. If multiple vectors are allowed, what is the plan for keeping the > number of delete vectors small? > I see multiple vectors per data file as an optimization for very large > data file, and I'm not sure if it's over design. > > > 5. Would we allow writing multiple delete vectors into the same file? > I don't want to do that. Merging delete vectors into one file have two > concerns: > > - Write amplification. > - It makes concurrent modification of data files difficult. > > > 6. How would we track which files are affected by a combined file of > delete vectors? > Sorry, I don't quite get your point. > > > 7. What are the details of the proposed file format? > I think roaring bitmap would be a good candidate, but other columnar > formats such as parquet, orc are also possible since they provided great > compression for boolean columns. I've mentioned it here > <https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.nrhcjanzai0v> > > On Thu, Sep 21, 2023 at 4:53 PM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi Ryan, >> >> Thanks for the feedback. Unfortunately, I was not able to join the >> Iceberg community sync meeting yesterday, I promise I will join the >> next ones. >> >> I think the proposal is very interesting and also the >> discussion/comments in the document. I agree that some points should >> be discussed further. I propose to update the document with your >> points/questions. >> >> Thanks ! >> >> Regards >> JB >> >> On Thu, Sep 21, 2023 at 2:02 AM Ryan Blue <b...@tabular.io> wrote: >> > >> > Renjie, thanks for the proposal. >> > >> > We talked about this today in the Iceberg community sync and the >> general feedback was that we're excited work on this, but the proposal left >> a few areas unclear. There are a few decisions about how to manage the >> delete vectors that need to be added to the design. For example: >> > 1. Would there be only one delete vector per data file? >> > 2. Would this require merge of existing vectors and new deletes at >> write time? >> > 3. How would the data file for a vector be identified? >> > 4. If multiple vectors are allowed, what is the plan for keeping the >> number of delete vectors small? >> > 5. Would we allow writing multiple delete vectors into the same file? >> > 6. How would we track which files are affected by a combined file of >> delete vectors? >> > 7. What are the details of the proposed file format? >> > >> > In short, we just want to better understand how all this would work. >> > >> > Thanks! >> > >> > Ryan >> > >> > >> > On Mon, Sep 18, 2023 at 8:22 PM Renjie Liu <liurenjie2...@gmail.com> >> wrote: >> >> >> >> Hi, all: >> >> >> >> >> >> >> >> I have a proposal to introduce deletion vector file to reduce write >> amplification of iceberg table: >> >> >> >> >> https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit?usp=sharing >> >> >> >> >> >> >> >> Welcome to comment, and looking forward to hear your advice. >> > >> > >> > >> > -- >> > Ryan Blue >> > Tabular >> > > > -- > Renjie Liu > Software Engineer, MVAD > -- Renjie Liu Software Engineer, MVAD