For the conversion from Delta to Iceberg, wouldn't we need to scan all of the Delta Vectors if we choose a different CRC or other endian-ness? Does delta mandate that writers also include this information in their metadata files?
On Thu, Oct 17, 2024 at 4:26 PM Anton Okolnychyi <aokolnyc...@gmail.com> wrote: > We would want to have magic bytes + checksum as part of the blob in > Iceberg, as discussed in the spec PRs. If we chose something other than CRC > and/or use little endian for all parts of the blob, this would break the > compatibility in either direction and would prevent the use case that Scott > was mentioning. > > - Anton > > чт, 17 жовт. 2024 р. о 08:58 Bart Samwel <b...@databricks.com.invalid> > пише: > >> I hope it's OK if I chime in. I'm one of the people responsible for the >> format for position deletes that is used in Delta Lake and I've been >> reading along with the discussion. Given that the main sticking point is >> whether this compatibility is worth the associated "not pure" spec, I >> figured that maybe I can mention what the consequences would be for the >> Delta Lake developers and users, depending on the outcome of this >> discussion. I can also give some historical background, in case people find >> that interesting. >> >> 1) Historical background on why the Delta Lake format is the way it is. >> >> The reason that this length field was added on the Delta Lake side is >> because we didn't have a framing format like Puffin. Like you, we wanted >> the Deletion Vector files to be parseable by themselves, if only for >> debugging purposes. If we could go back, then we might have adopted Puffin. >> Or we would have made the pointers in the metadata point at only the blob + >> CRC, and kept the length outside of it, in the framing format. But the >> reality is that right now there are many clients out there that read the >> current format, and we can't change this anymore. :( The endianness >> difference is simply an unfortunate historical accident. They are at >> different layers, and this was the first time we really did anything >> binary-ish in Delta Lake, so we didn't actually have any consistent >> baseline to be consistent with. We only noticed the difference once it had >> "escaped" into the wild, and then it was too late. >> >> Am I super happy with it? No. Is it *terrible*? Well, not terrible >> enough for us to go back and upgrade the protocol to fix it. It doesn't >> lead to broken behavior. This is just a historical idiosyncrasy, and the >> friction caused by protocol changes is much higher than any benefit from a >> cleaner spec. So basically, we're stuck with it until the next time we do a >> major overhaul of the protocol. >> >> (2) What are the consequences for Delta Lake if this is *not* made >> compatible? >> >> Well, then we'd have to support this new layout in Delta Lake. This would >> be a long and relatively painful process. >> >> It would not just be a matter of "retconning" it into the protocol and >> updating the libraries. There are simply too many connectors out there, >> owned by different vendors etc. Until they would adopt the change, they >> would simply error out on these files *at runtime with weird errors*, or >> potentially even use the invalid values and crash and burn. (Lack of proper >> input validation is unfortunately a real thing in the wild.) >> >> So instead, what we would do is to add this in a new protocol version of >> Delta Lake. Or actually, it would be a "table feature", since Delta Lake >> has a-la-carte protocol features. But these features tend to take a long >> time to fully permeate the connector ecosystem, and people don't actually >> upgrade their systems very quickly. That means that realistically, nobody >> would be able to make use of this for quite a while. >> >> So what would need to happen instead? For now we would have to rewrite >> the delete files on conversion, only to add this annoying little length >> field. This would add at least 200 ms of latency to any metadata >> conversion, if only because of the cloud object storage GET and PUT >> latency. Furthermore, the conversion latency for a single commit would >> become dependent on the number of delete files instead of being O(1). And >> it would take significant development time to actually make this work and >> to make this scale. >> >> Based on these consequences, you can imagine why I would *really* >> appreciate it if the community could weigh this aspect as part of their >> deliberations. >> >> (3) Is Iceberg -> Delta Lake compatibility actually important enough to >> care about? >> >> From where I'm standing, compatibility is nearly always very important. >> It's not important for users who have standardized fully on Iceberg, and >> those are probably the most represented here in the dev community. But in >> the world that I'm seeing, companies are generally using a mixture of many >> different systems, and they are suffering because of the inability for >> systems to operate efficiently on each others' data. Being able to convert >> easily and efficiently in both directions benefits users. In this case it's >> about Iceberg and Delta Lake, but IMO this is true as a principle >> regardless of which systems you're talking about -- lower friction for >> interoperability is very high value because it increases users' choice in >> the tools that they can use -- it allows them to choose the right tool for >> the job at hand. And it doesn't matter if users are converting from Delta >> Lake to Iceberg or the other way around, they are in fact all Iceberg users! >> >> Putting it simply: I have heard many users complain that they can't >> (efficiently) read data from system X in system Y. At the same time, I have >> never heard a user complaining about having inconsistent endianness in >> their protocols. >> >> On Thu, Oct 17, 2024 at 11:02 AM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >>> Hi folks, >>> >>> As Daniel said, I think we have actually two proposals in one: >>> 1. The first proposal is "improvement of positional delete files", >>> using delete vectors stored in Puffin files. I like this proposal, it >>> makes a lot of sense. I think with a kind of consensus here (we >>> discussed about how to parse Puffin files, etc, good discussion). >>> 2. Then, based on (1), is support vector format "compatible" with >>> Delta. This is also interesting. However, do we really need this in >>> Spec V3 ? Why not focus on the original proposal (improvement of >>> positional delete) with a simple approach, and evaluate Delta >>> compatibility later ? If the compatibility is "easy", I'm not against >>> to include in V3, but users might be disappointed if bringing this >>> means a tradeoff. >>> >>> Imho, I will focus on 1 because it would be a great feature for the >>> Iceberg community. >>> >>> Regards >>> JB >>> >>> On Wed, Oct 16, 2024 at 9:16 PM Daniel Weeks <dwe...@apache.org> wrote: >>> > >>> > Hey Everyone, >>> > >>> > I feel like at this point we've articulated all of the various options >>> and paths forward, but this really just comes down to a matter of whether >>> we want to make a concession here for the purpose of compatibility. >>> > >>> > If we were building this with no prior art, I would expect to omit the >>> length and align the endianness, but given there's an opportunity to close >>> the gap with minor inefficiency, it merits real consideration. >>> > >>> > This proposal takes into consideration bi-directional compatibility >>> while maintaining backward compatibility. Do we feel this is beneficial to >>> the larger community or should we discard efforts for compatibility? >>> > >>> > -Dan >>> > >>> > On Wed, Oct 16, 2024 at 11:01 AM rdb...@gmail.com <rdb...@gmail.com> >>> wrote: >>> >> >>> >> Thanks, Russell for the clear summary of the pros and cons! I agree >>> there's some risk to Iceberg implementations, but I think that is mitigated >>> somewhat by code reuse. For example, an engine like Trino could simply >>> reuse code for reading Delta bitmaps, so we would get some validation and >>> support more easily. >>> >> >>> >> Micah, I agree with the requirements that you listed, but I would say >>> #2 is not yet a "requirement" for the design. It's a consideration that I >>> think has real value, but it's up to the community whether we want to add >>> some cost to #1 to make #2 happen. I definitely think that #3 is a >>> requirement so that we can convert Delta to Iceberg metadata (as in the >>> iceberg-delta-lake module). >>> >> >>> >> For the set of options, I would collapse a few of those options >>> because I think that we would use the same bitmap representation, the >>> portable 64-bit roaring bitmap. >>> >> >>> >> If that's the case (and probably even if we had some other >>> representation), then Delta can always add support for reading Iceberg >>> delete vectors. That means we either go with the current proposal (a) that >>> preserves the ability for existing Delta clients to read, or we go with a >>> different proposal that we think is better, in which case Delta adds >>> support. >>> >> >>> >> I think both options (c) and (d) have the same effect: Delta readers >>> need to change and that breaks forward compatibility. Specifically: >>> >> * I think that Option (c) would mean that we set the offset to either >>> magic bytes or directly to the start of the roaring bitmap, so I think we >>> will almost certainly be able to read Delta DVs. Even if we didn't have a >>> similar bitmap encoding, we would probably end up adding support for >>> reading Delta DVs for iceberg-delta-lake. Then it's a question of whether >>> support for converted files is required -- similar to how we handle missing >>> partition values in data files from Hive tables that we just updated the >>> spec to clarify. >>> >> * Option (d) is still incompatible with existing Delta readers, so >>> there isn't much of a difference between this and (b) >>> >> >>> >> To me, Micah's requirement #2 is a good goal, but needs to be >>> balanced against the cost. I don't see that cost as too high, and I think >>> avoiding fragmentation across the projects helps us work together more in >>> the future. But again, that may be my goal and not a priority for the >>> broader Iceberg community. >>> >> >>> >> Ryan >>> >> >>> >> On Wed, Oct 16, 2024 at 10:10 AM Micah Kornfield < >>> emkornfi...@gmail.com> wrote: >>> >>> >>> >>> One small point >>> >>>> >>> >>>> Theoretically we could end up with iceberg implementers who have >>> bugs in this part of the code and we wouldn’t even know it was an issue >>> till someone converted the table to delta. >>> >>> >>> >>> >>> >>> I guess we could mandate readers validate all fields here to make >>> sure they are all consistent, even if unused. >>> >>> >>> >>> Separately, I think it might pay to take a step back and restate >>> desired requirements of this design (in no particular order): >>> >>> 1. The best possible implementation of DVs (limited redundancy, no >>> extraneous fields, CPU efficiency, minimal space, etc). >>> >>> 2. The ability for Delta Lake readers to read Iceberg DVs >>> >>> 3. The ability for Iceberg readers to read Delta Lake DVs >>> >>> >>> >>> The current proposal accomplishes 2 and 3 at very low cost with some >>> for cost for 1. I still think 1 is important. Table formats are still >>> going through a very large growth phase so taking suboptimal choices, when >>> there are better choices that don't add substantial cost, shouldn't be done >>> lightly. Granted DVs are only going to be a very small part of the cost of >>> any table format. >>> >>> >>> >>> I think it is worth discussing other options to see if we think >>> there is a better one (if there isn't then I would propose we continue with >>> the current proposal). Please chime in if I missed one but off the top of >>> my head these are: >>> >>> >>> >>> a. Go forward with current proposal >>> >>> b. Create a different format DV that we feel is a better, and take >>> no additional steps for compatibility with Delta Lake. >>> >>> c. Create a different format DV that we feel is a better, and allow >>> backwards compatibility by adding "reader" support for Delta Lake DVs in >>> the spec, but not "writer support". >>> >>> d. Go forward with the current proposal but use offset and length >>> to trim off the "offset" bytes. (I assume this would break Delta Lake >>> Readers but I think Iceberg Readers could still read Delta Lake tables). >>> This option is very close to C but doesn't address all concerns around DV >>> format). >>> >>> >>> >>> Out of these three, my slight preference would be option c (add >>> migration capabilities from Delta Lake to Iceberg), followed by option a >>> (current proposal). >>> >>> >>> >>> Cheers, >>> >>> Micah >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Oct 15, 2024 at 9:32 PM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> >>> >>>> @Scott We would have the ability to read delta vectors regardless >>> of what we pick since on Iceberg side we really just need the bitmap and >>> what offset it is located at within a file, everything else could be in the >>> Iceberg metadata. We don’t have any disagreement on this aspect I think. >>> >>>> >>> >>>> The question is whether we would write additional Delta specific >>> metadata into the vector itself that an Iceberg implementation would not >>> use so that current Delta readers could read Iceberg delete vectors without >>> a code change or rewriting the vectors. The underlying representation would >>> still be the same between the two formats. >>> >>>> >>> >>>> The pros to doing this are that a reverse translation of iceberg to >>> delta would be much simpler. Any implementers who already have delta >>> vector read code can probably mostly reuse it although our metadata would >>> let them skip to just reading the bitmap. >>> >>>> >>> >>>> The cons are that the metadata being written isn’t used by Iceberg >>> so any real tests would require using a delta reader, anything else would >>> just be synthetic. Theoretically we could end up with iceberg implementers >>> who have bugs in this part of the code and we wouldn’t even know it was an >>> issue till someone converted the table to delta. Other iceberg readers >>> would just be ignoring these bytes, so we essentially are adding a >>> requirement and complexity (although not that much) to Iceberg writers for >>> the benefit of current Delta readers. Delta would probably also have to add >>> a new fields to their metadata representations to capture the vector >>> metadata to handle our bitmaps. >>> >>>> >>> >>>> On Tue, Oct 15, 2024 at 5:56 PM Scott Cowell < >>> scott.cow...@dremio.com.invalid> wrote: >>> >>>>> >>> >>>>> From an engine perspective I think compatibility between Delta and >>> Iceberg on DVs is a great thing to have. The additions for cross-compat >>> seem a minor thing to me that is vastly outweighed by a future where Delta >>> tables with DVs were supported in Delta Uniform and could be read by any >>> Iceberg V3 compliant engine. >>> >>>>> >>> >>>>> -Scott >>> >>>>> >>> >>>>> On Tue, Oct 15, 2024 at 2:06 PM Anton Okolnychyi < >>> aokolnyc...@gmail.com> wrote: >>> >>>>>> >>> >>>>>> Are there engines/vendors/companies in the community that support >>> both Iceberg and Delta and would benefit from having one blob layout for >>> DVs? >>> >>>>>> >>> >>>>>> - Anton >>> >>>>>> >>> >>>>>> вт, 15 жовт. 2024 р. о 11:10 rdb...@gmail.com <rdb...@gmail.com> >>> пише: >>> >>>>>>> >>> >>>>>>> Thanks, Szehon. >>> >>>>>>> >>> >>>>>>> To clarify on compatibility, using the same format for the blobs >>> makes it so that existing Delta readers can read and use the DVs written by >>> Iceberg. I'd love for Delta to adopt Puffin, but if we adopt the extra >>> fields they would not need to change how readers work. That's why I think >>> there is a benefit to using the same format. We avoid fragmentation and >>> make sure data and delete files are compatible. No unnecessary >>> fragmentation. >>> >>>>>>> >>> >>>>>>> Ryan >>> >>>>>>> >>> >>>>>>> On Tue, Oct 15, 2024 at 10:57 AM Szehon Ho < >>> szehon.apa...@gmail.com> wrote: >>> >>>>>>>> >>> >>>>>>>> This is awesome work by Anton and Ryan, it looks like a ton of >>> effort has gone into the V3 position vector proposal to make it clean and >>> efficient, a long time coming and Im truly excited to see the great >>> improvement in storage/perf. >>> >>>>>>>> >>> >>>>>>>> wrt to these fields, I think most of the concerns are already >>> mentioned by the other community members in the prs >>> https://github.com/apache/iceberg/pull/11238 and >>> https://github.com/apache/iceberg/pull/11238, so not much to add. The >>> DV itself is RoaringBitmap 64-bit format so that's great, the argument for >>> CRC seems reasonable, and I dont have enough data to be opinionated towards >>> endian/magic byte. >>> >>>>>>>> >>> >>>>>>>> But I do lean towards the many PR comments that the extra >>> length field is unnecessary, and would just add confusion. It seemed to me >>> that the Iceberg community has made so much effort to trim to spec to the >>> bare minimum for cleanliness and efficiency, so I do feel the field is not >>> in the normal direction of the project. Also Im not clear on the plan for >>> old Delta readers, they cant read Puffin anyway, if Delta adopts Puffin, >>> then new readers could adopt? Anyway great work again, thanks for raising >>> the issue on devlist! >>> >>>>>>>> >>> >>>>>>>> Thanks, >>> >>>>>>>> Szehon >>> >>>>>>>> >>> >>>>>>>> On Mon, Oct 14, 2024 at 5:14 PM rdb...@gmail.com < >>> rdb...@gmail.com> wrote: >>> >>>>>>>>> >>> >>>>>>>>> > I think it might be worth mentioning the current proposal >>> makes some, mostly minor, design choices to try to be compatible with Delta >>> Lake deletion vectors. >>> >>>>>>>>> >>> >>>>>>>>> Yes it does, and thanks for pointing this out, Micah. I think >>> it's important to consider whether compatibility is important to this >>> community. I just replied to Piotr on the PR, but I'll adapt some of that >>> response here to reach the broader community. >>> >>>>>>>>> >>> >>>>>>>>> I think there is value in supporting compatibility with older >>> Delta readers, but I acknowledge that this may be my perspective because my >>> employer has a lot of Delta customers that we are going to support now and >>> in the future. >>> >>>>>>>>> >>> >>>>>>>>> The main use case for maintaining compatibility with the Delta >>> format is that it's hard to move old jobs to new code in a migration. We >>> see a similar issue in Hive to Iceberg migrations, where unknown older >>> readers prevent migration entirely because they are hard to track down and >>> often read files directly from the backing object store. I'd like to avoid >>> the same problem here, where all readers need to be identified and migrated >>> at the same time. Compatibility with the format those readers expect makes >>> it possible to maintain Delta metadata for them temporarily. That increases >>> confidence that things won't randomly break and makes it easier to get >>> people to move forward. >>> >>>>>>>>> >>> >>>>>>>>> The second reason for maintaining compatibility is that we >>> want for the formats to become more similar. My hope is that we can work >>> across both communities and come up with a common metadata format in a >>> future version -- which explains my interest in smooth migrations. >>> Maintaining compatibility in cases like this builds trust and keeps our >>> options open: if we have compatible data layers, then it's easier to build >>> a compatible metadata layer. I'm hoping that if we make the blob format >>> compatible, we can get the Delta community to start using Puffin for better >>> self-describing delete files. >>> >>>>>>>>> >>> >>>>>>>>> Other people may not share those goals, so I think it helps to >>> consider what is being compromised for this compatibility. I don't think it >>> is too much. There are 2 additional fields: >>> >>>>>>>>> * A 4-byte length field (that Iceberg doesn't need) >>> >>>>>>>>> * A 4-byte CRC to validate the contents of the bitmap >>> >>>>>>>>> >>> >>>>>>>>> There are also changes to how these would have been added if >>> the Iceberg community were building this independently. >>> >>>>>>>>> * Our initial version didn't include a CRC at all, but now >>> that we think it's useful compatibility means using a CRC-32 checksum >>> rather than a newer one >>> >>>>>>>>> * The Delta format uses big endian for its fields (or mixed >>> endian if you consider RoaringBitmap is LE) >>> >>>>>>>>> * The magic bytes (added to avoid reading the Puffin footer) >>> would have been different >>> >>>>>>>>> >>> >>>>>>>>> Overall, I don't think that those changes to what we would >>> have done are unreasonable. It's only 8 extra bytes and half of them are >>> for a checksum that is a good idea. >>> >>>>>>>>> >>> >>>>>>>>> I'm looking forward to what the rest of the community thinks >>> about this. Thanks for reviewing the PR! >>> >>>>>>>>> >>> >>>>>>>>> Ryan >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> On Sun, Oct 13, 2024 at 10:45 PM Jean-Baptiste Onofré < >>> j...@nanthrax.net> wrote: >>> >>>>>>>>>> >>> >>>>>>>>>> Hi >>> >>>>>>>>>> >>> >>>>>>>>>> Thanks for the PRs ! I reviewed Anton's document, I will do a >>> pass on the PRs. >>> >>>>>>>>>> >>> >>>>>>>>>> Imho, it's important to get feedback from query engines, as, >>> if delete >>> >>>>>>>>>> vectors is not a problem per se (it's what we are using as >>> internal >>> >>>>>>>>>> representation), the use of Puffin files to store it is >>> "impactful" >>> >>>>>>>>>> for the query engines (probably some query engines might need >>> to >>> >>>>>>>>>> implement Puffin spec (read/write) using other language than >>> Java, for >>> >>>>>>>>>> instance Apache Impala). >>> >>>>>>>>>> >>> >>>>>>>>>> I like the proposal, I just hope we won't "surprise" some >>> query >>> >>>>>>>>>> engines with extra work :) >>> >>>>>>>>>> >>> >>>>>>>>>> Regards >>> >>>>>>>>>> JB >>> >>>>>>>>>> >>> >>>>>>>>>> On Thu, Oct 10, 2024 at 11:41 PM rdb...@gmail.com < >>> rdb...@gmail.com> wrote: >>> >>>>>>>>>> > >>> >>>>>>>>>> > Hi everyone, >>> >>>>>>>>>> > >>> >>>>>>>>>> > There seems to be broad agreement around Anton's proposal >>> to use deletion vectors in Iceberg v3, so I've opened two PRs that update >>> the spec with the proposed changes. The first, PR #11238, adds a new Puffin >>> blob type, delete-vector-v1, that stores a delete vector. The second, PR >>> #11240, updates the Iceberg table spec. >>> >>>>>>>>>> > >>> >>>>>>>>>> > Please take a look and comment! >>> >>>>>>>>>> > >>> >>>>>>>>>> > Ryan >>> >>