>From an engine perspective I think compatibility between Delta and Iceberg on DVs is a great thing to have. The additions for cross-compat seem a minor thing to me that is vastly outweighed by a future where Delta tables with DVs were supported in Delta Uniform and could be read by any Iceberg V3 compliant engine.
-Scott On Tue, Oct 15, 2024 at 2:06 PM Anton Okolnychyi <aokolnyc...@gmail.com> wrote: > Are there engines/vendors/companies in the community that support both > Iceberg and Delta and would benefit from having one blob layout for DVs? > > - Anton > > вт, 15 жовт. 2024 р. о 11:10 rdb...@gmail.com <rdb...@gmail.com> пише: > >> Thanks, Szehon. >> >> To clarify on compatibility, using the same format for the blobs makes it >> so that existing Delta readers can read and use the DVs written by Iceberg. >> I'd love for Delta to adopt Puffin, but if we adopt the extra fields they >> would not need to change how readers work. That's why I think there is a >> benefit to using the same format. We avoid fragmentation and make sure data >> and delete files are compatible. No unnecessary fragmentation. >> >> Ryan >> >> On Tue, Oct 15, 2024 at 10:57 AM Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> >>> This is awesome work by Anton and Ryan, it looks like a ton of effort >>> has gone into the V3 position vector proposal to make it clean and >>> efficient, a long time coming and Im truly excited to see the great >>> improvement in storage/perf. >>> >>> wrt to these fields, I think most of the concerns are already mentioned >>> by the other community members in the prs >>> https://github.com/apache/iceberg/pull/11238 and >>> https://github.com/apache/iceberg/pull/11238, so not much to add. The >>> DV itself is RoaringBitmap 64-bit format so that's great, the argument for >>> CRC seems reasonable, and I dont have enough data to be opinionated towards >>> endian/magic byte. >>> >>> But I do lean towards the many PR comments that the extra length field >>> is unnecessary, and would just add confusion. It seemed to me that the >>> Iceberg community has made so much effort to trim to spec to the bare >>> minimum for cleanliness and efficiency, so I do feel the field is not in >>> the normal direction of the project. Also Im not clear on the plan for old >>> Delta readers, they cant read Puffin anyway, if Delta adopts Puffin, then >>> new readers could adopt? Anyway great work again, thanks for raising the >>> issue on devlist! >>> >>> Thanks, >>> Szehon >>> >>> On Mon, Oct 14, 2024 at 5:14 PM rdb...@gmail.com <rdb...@gmail.com> >>> wrote: >>> >>>> > I think it might be worth mentioning the current proposal makes some, >>>> mostly minor, design choices to try to be compatible with Delta Lake >>>> deletion vectors. >>>> >>>> Yes it does, and thanks for pointing this out, Micah. I think it's >>>> important to consider whether compatibility is important to this community. >>>> I just replied to Piotr on the PR, but I'll adapt some of that response >>>> here to reach the broader community. >>>> >>>> I think there is value in supporting compatibility with older Delta >>>> readers, but I acknowledge that this may be my perspective because my >>>> employer has a lot of Delta customers that we are going to support now and >>>> in the future. >>>> >>>> The main use case for maintaining compatibility with the Delta format >>>> is that it's hard to move old jobs to new code in a migration. We see a >>>> similar issue in Hive to Iceberg migrations, where unknown older readers >>>> prevent migration entirely because they are hard to track down and often >>>> read files directly from the backing object store. I'd like to avoid the >>>> same problem here, where all readers need to be identified and migrated at >>>> the same time. Compatibility with the format those readers expect makes it >>>> possible to maintain Delta metadata for them temporarily. That increases >>>> confidence that things won't randomly break and makes it easier to get >>>> people to move forward. >>>> >>>> The second reason for maintaining compatibility is that we want for the >>>> formats to become more similar. My hope is that we can work across both >>>> communities and come up with a common metadata format in a future version >>>> -- which explains my interest in smooth migrations. Maintaining >>>> compatibility in cases like this builds trust and keeps our options open: >>>> if we have compatible data layers, then it's easier to build a compatible >>>> metadata layer. I'm hoping that if we make the blob format compatible, we >>>> can get the Delta community to start using Puffin for better >>>> self-describing delete files. >>>> >>>> Other people may not share those goals, so I think it helps to consider >>>> what is being compromised for this compatibility. I don't think it is too >>>> much. There are 2 additional fields: >>>> * A 4-byte length field (that Iceberg doesn't need) >>>> * A 4-byte CRC to validate the contents of the bitmap >>>> >>>> There are also changes to how these would have been added if the >>>> Iceberg community were building this independently. >>>> * Our initial version didn't include a CRC at all, but now that we >>>> think it's useful compatibility means using a CRC-32 checksum rather than a >>>> newer one >>>> * The Delta format uses big endian for its fields (or mixed endian if >>>> you consider RoaringBitmap is LE) >>>> * The magic bytes (added to avoid reading the Puffin footer) would have >>>> been different >>>> >>>> Overall, I don't think that those changes to what we would have done >>>> are unreasonable. It's only 8 extra bytes and half of them are for a >>>> checksum that is a good idea. >>>> >>>> I'm looking forward to what the rest of the community thinks about >>>> this. Thanks for reviewing the PR! >>>> >>>> Ryan >>>> >>>> >>>> On Sun, Oct 13, 2024 at 10:45 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>> wrote: >>>> >>>>> Hi >>>>> >>>>> Thanks for the PRs ! I reviewed Anton's document, I will do a pass on >>>>> the PRs. >>>>> >>>>> Imho, it's important to get feedback from query engines, as, if delete >>>>> vectors is not a problem per se (it's what we are using as internal >>>>> representation), the use of Puffin files to store it is "impactful" >>>>> for the query engines (probably some query engines might need to >>>>> implement Puffin spec (read/write) using other language than Java, for >>>>> instance Apache Impala). >>>>> >>>>> I like the proposal, I just hope we won't "surprise" some query >>>>> engines with extra work :) >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> On Thu, Oct 10, 2024 at 11:41 PM rdb...@gmail.com <rdb...@gmail.com> >>>>> wrote: >>>>> > >>>>> > Hi everyone, >>>>> > >>>>> > There seems to be broad agreement around Anton's proposal to use >>>>> deletion vectors in Iceberg v3, so I've opened two PRs that update the >>>>> spec >>>>> with the proposed changes. The first, PR #11238, adds a new Puffin blob >>>>> type, delete-vector-v1, that stores a delete vector. The second, PR >>>>> #11240, >>>>> updates the Iceberg table spec. >>>>> > >>>>> > Please take a look and comment! >>>>> > >>>>> > Ryan >>>>> >>>>