Thanks for putting the spec PRs together, Ryan! A bit of context below.
The concept of DVs is not external to Iceberg. We have been using Roaring bitmaps (aka DVs) as an in-memory representation for position deletes, which allowed us to support vectorized reads and buffer out-of-order positions in writers. The design doc [1] that I shared earlier argues that persisting these DVs on disk is a better choice than translating to and from Avro/ORC/Parquet files. I reached out to many folks in the community that expressed interest in position deletes to make sure we are on the same page. I think it would be fair to say we reached consensus on introducing Puffin delete files in V3. The reason this thread is not a vote but rather an ask for feedback is that the spec PRs include two extra fields in the blob for compatibility with Delta (blob data length and CRC). If we were to do it from scratch, I’d probably skip the length and use consistent byte order throughout the blob. Are these differences major enough to invent a slightly different format? I personally don’t think so. That said, it is a question to the Iceberg community whether we are OK making these compromises. I created a PR [2] that implements the proposed spec as an example. Take a look and let me know. Regardless of whether we follow the Delta blob layout, I’d still propose to keep the magic number and CRC for the reasons I explained here [3]. - Anton [1] - https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM/ [2] - https://github.com/apache/iceberg/pull/11302 [3] - https://github.com/apache/iceberg/pull/11238#discussion_r1798622067 сб, 12 жовт. 2024 р. о 15:43 Yufei Gu <flyrain...@gmail.com> пише: > I’d like to offer a perspective on compatibility. If the design is robust > and reasonable, it is certainly welcomed. However, if the design falls > short, it becomes a compromise—not just for Iceberg users, but for the > entire ecosystem. > > I look forward to hearing your thoughts on this. > > > Yufei > > > On Fri, Oct 11, 2024 at 9:06 PM Manu Zhang <owenzhang1...@gmail.com> > wrote: > >> Hi Ryan, >> >> Do you mean the doc Improve Position Deletes in V3 >> <https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM/edit?tab=t.0> >> by >> Anton? I don't recall Anton used the term "deletion vector" in his >> proposal. >> >> On Sat, Oct 12, 2024 at 12:30 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> I think it might be worth mentioning the current proposal makes some, >>> mostly minor, design choices to try to be compatible with Delta Lake >>> deletion vectors. I think there might be a general philosophical question >>> on what compromises the community is willing to make for compatibility >>> reasons. >>> >>> On Thu, Oct 10, 2024 at 2:42 PM rdb...@gmail.com <rdb...@gmail.com> >>> wrote: >>> >>>> Hi everyone, >>>> >>>> There seems to be broad agreement around Anton's proposal to use >>>> deletion vectors in Iceberg v3, so I've opened two PRs that update the spec >>>> with the proposed changes. The first, PR #11238 >>>> <https://github.com/apache/iceberg/pull/11238/files>, adds a new >>>> Puffin blob type, delete-vector-v1, that stores a delete vector. The >>>> second, PR #11240 <https://github.com/apache/iceberg/pull/11240/files>, >>>> updates the Iceberg table spec. >>>> >>>> Please take a look and comment! >>>> >>>> Ryan >>>> >>>