Re: Spec changes for deletion vectors

Anton Okolnychyi Thu, 17 Oct 2024 16:16:13 -0700

>
> For the conversion from Delta to Iceberg, wouldn't we need to scan all of
> the Delta Vectors if we choose a different CRC or other endian-ness?



Exactly, we would not be able to expose Delta as Iceberg if we choose a
different checksum type or byte order.

Does delta mandate that writers also include this information in their
> metadata files?


If I understand correctly, the checksum is only in the DV file, not in the
metadata.

- Anton

чт, 17 жовт. 2024 р. о 14:51 Russell Spitzer <russell.spit...@gmail.com>
пише:

> For the conversion from Delta to Iceberg, wouldn't we need to scan all of
> the Delta Vectors if we choose a different CRC or other endian-ness? Does
> delta mandate that writers also include this information in their metadata
> files?
>
> On Thu, Oct 17, 2024 at 4:26 PM Anton Okolnychyi <aokolnyc...@gmail.com>
> wrote:
>
>> We would want to have magic bytes + checksum as part of the blob in
>> Iceberg, as discussed in the spec PRs. If we chose something other than CRC
>> and/or use little endian for all parts of the blob, this would break the
>> compatibility in either direction and would prevent the use case that Scott
>> was mentioning.
>>
>> - Anton
>>
>> чт, 17 жовт. 2024 р. о 08:58 Bart Samwel <b...@databricks.com.invalid>
>> пише:
>>
>>> I hope it's OK if I chime in. I'm one of the people responsible for the
>>> format for position deletes that is used in Delta Lake and I've been
>>> reading along with the discussion. Given that the main sticking point is
>>> whether this compatibility is worth the associated "not pure" spec, I
>>> figured that maybe I can mention what the consequences would be for the
>>> Delta Lake developers and users, depending on the outcome of this
>>> discussion. I can also give some historical background, in case people find
>>> that interesting.
>>>
>>> 1) Historical background on why the Delta Lake format is the way it is.
>>>
>>> The reason that this length field was added on the Delta Lake side is
>>> because we didn't have a framing format like Puffin. Like you, we wanted
>>> the Deletion Vector files to be parseable by themselves, if only for
>>> debugging purposes. If we could go back, then we might have adopted Puffin.
>>> Or we would have made the pointers in the metadata point at only the blob +
>>> CRC, and kept the length outside of it, in the framing format. But the
>>> reality is that right now there are many clients out there that read the
>>> current format, and we can't change this anymore. :( The endianness
>>> difference is simply an unfortunate historical accident. They are at
>>> different layers, and this was the first time we really did anything
>>> binary-ish in Delta Lake, so we didn't actually have any consistent
>>> baseline to be consistent with. We only noticed the difference once it had
>>> "escaped" into the wild, and then it was too late.
>>>
>>> Am I super happy with it? No. Is it *terrible*? Well, not terrible
>>> enough for us to go back and upgrade the protocol to fix it. It doesn't
>>> lead to broken behavior. This is just a historical idiosyncrasy, and the
>>> friction caused by protocol changes is much higher than any benefit from a
>>> cleaner spec. So basically, we're stuck with it until the next time we do a
>>> major overhaul of the protocol.
>>>
>>> (2) What are the consequences for Delta Lake if this is *not* made
>>> compatible?
>>>
>>> Well, then we'd have to support this new layout in Delta Lake. This
>>> would be a long and relatively painful process.
>>>
>>> It would not just be a matter of "retconning" it into the protocol and
>>> updating the libraries. There are simply too many connectors out there,
>>> owned by different vendors etc. Until they would adopt the change, they
>>> would simply error out on these files *at runtime with weird errors*,
>>> or potentially even use the invalid values and crash and burn. (Lack of
>>> proper input validation is unfortunately a real thing in the wild.)
>>>
>>> So instead, what we would do is to add this in a new protocol version of
>>> Delta Lake. Or actually, it would be a "table feature", since Delta Lake
>>> has a-la-carte protocol features. But these features tend to take a long
>>> time to fully permeate the connector ecosystem, and people don't actually
>>> upgrade their systems very quickly. That means that realistically, nobody
>>> would be able to make use of this for quite a while.
>>>
>>> So what would need to happen instead? For now we would have to rewrite
>>> the delete files on conversion, only to add this annoying little length
>>> field. This would add at least 200 ms of latency to any metadata
>>> conversion, if only because of the cloud object storage GET and PUT
>>> latency. Furthermore, the conversion latency for a single commit would
>>> become dependent on the number of delete files instead of being O(1). And
>>> it would take significant development time to actually make this work and
>>> to make this scale.
>>>
>>> Based on these consequences, you can imagine why I would *really*
>>> appreciate it if the community could weigh this aspect as part of their
>>> deliberations.
>>>
>>> (3) Is Iceberg -> Delta Lake compatibility actually important enough to
>>> care about?
>>>
>>> From where I'm standing, compatibility is nearly always very important.
>>> It's not important for users who have standardized fully on Iceberg, and
>>> those are probably the most represented here in the dev community. But in
>>> the world that I'm seeing, companies are generally using a mixture of many
>>> different systems, and they are suffering because of the inability for
>>> systems to operate efficiently on each others' data. Being able to convert
>>> easily and efficiently in both directions benefits users. In this case it's
>>> about Iceberg and Delta Lake, but IMO this is true as a principle
>>> regardless of which systems you're talking about -- lower friction for
>>> interoperability is very high value because it increases users' choice in
>>> the tools that they can use -- it allows them to choose the right tool for
>>> the job at hand. And it doesn't matter if users are converting from Delta
>>> Lake to Iceberg or the other way around, they are in fact all Iceberg users!
>>>
>>> Putting it simply: I have heard many users complain that they can't
>>> (efficiently) read data from system X in system Y. At the same time, I have
>>> never heard a user complaining about having inconsistent endianness in
>>> their protocols.
>>>
>>> On Thu, Oct 17, 2024 at 11:02 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> As Daniel said, I think we have actually two proposals in one:
>>>> 1. The first proposal is "improvement of positional delete files",
>>>> using delete vectors stored in Puffin files. I like this proposal, it
>>>> makes a lot of sense. I think with a kind of consensus here (we
>>>> discussed about how to parse Puffin files, etc, good discussion).
>>>> 2. Then, based on (1), is support vector format "compatible" with
>>>> Delta. This is also interesting. However, do we really need this in
>>>> Spec V3 ? Why not focus on the original proposal (improvement of
>>>> positional delete) with a simple approach, and evaluate Delta
>>>> compatibility later ? If the compatibility is "easy", I'm not against
>>>> to include in V3, but users might be disappointed if bringing this
>>>> means a tradeoff.
>>>>
>>>> Imho, I will focus on 1 because it would be a great feature for the
>>>> Iceberg community.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Wed, Oct 16, 2024 at 9:16 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>> >
>>>> > Hey Everyone,
>>>> >
>>>> > I feel like at this point we've articulated all of the various
>>>> options and paths forward, but this really just comes down to a matter of
>>>> whether we want to make a concession here for the purpose of compatibility.
>>>> >
>>>> > If we were building this with no prior art, I would expect to omit
>>>> the length and align the endianness, but given there's an opportunity to
>>>> close the gap with minor inefficiency, it merits real consideration.
>>>> >
>>>> > This proposal takes into consideration bi-directional compatibility
>>>> while maintaining backward compatibility.  Do we feel this is beneficial to
>>>> the larger community or should we discard efforts for compatibility?
>>>> >
>>>> > -Dan
>>>> >
>>>> > On Wed, Oct 16, 2024 at 11:01 AM rdb...@gmail.com <rdb...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Thanks, Russell for the clear summary of the pros and cons! I agree
>>>> there's some risk to Iceberg implementations, but I think that is mitigated
>>>> somewhat by code reuse. For example, an engine like Trino could simply
>>>> reuse code for reading Delta bitmaps, so we would get some validation and
>>>> support more easily.
>>>> >>
>>>> >> Micah, I agree with the requirements that you listed, but I would
>>>> say #2 is not yet a "requirement" for the design. It's a consideration that
>>>> I think has real value, but it's up to the community whether we want to add
>>>> some cost to #1 to make #2 happen. I definitely think that #3 is a
>>>> requirement so that we can convert Delta to Iceberg metadata (as in the
>>>> iceberg-delta-lake module).
>>>> >>
>>>> >> For the set of options, I would collapse a few of those options
>>>> because I think that we would use the same bitmap representation, the
>>>> portable 64-bit roaring bitmap.
>>>> >>
>>>> >> If that's the case (and probably even if we had some other
>>>> representation), then Delta can always add support for reading Iceberg
>>>> delete vectors. That means we either go with the current proposal (a) that
>>>> preserves the ability for existing Delta clients to read, or we go with a
>>>> different proposal that we think is better, in which case Delta adds
>>>> support.
>>>> >>
>>>> >> I think both options (c) and (d) have the same effect: Delta readers
>>>> need to change and that breaks forward compatibility. Specifically:
>>>> >> * I think that Option (c) would mean that we set the offset to
>>>> either magic bytes or directly to the start of the roaring bitmap, so I
>>>> think we will almost certainly be able to read Delta DVs. Even if we didn't
>>>> have a similar bitmap encoding, we would probably end up adding support for
>>>> reading Delta DVs for iceberg-delta-lake. Then it's a question of whether
>>>> support for converted files is required -- similar to how we handle missing
>>>> partition values in data files from Hive tables that we just updated the
>>>> spec to clarify.
>>>> >> * Option (d) is still incompatible with existing Delta readers, so
>>>> there isn't much of a difference between this and (b)
>>>> >>
>>>> >> To me, Micah's requirement #2 is a good goal, but needs to be
>>>> balanced against the cost. I don't see that cost as too high, and I think
>>>> avoiding fragmentation across the projects helps us work together more in
>>>> the future. But again, that may be my goal and not a priority for the
>>>> broader Iceberg community.
>>>> >>
>>>> >> Ryan
>>>> >>
>>>> >> On Wed, Oct 16, 2024 at 10:10 AM Micah Kornfield <
>>>> emkornfi...@gmail.com> wrote:
>>>> >>>
>>>> >>> One small point
>>>> >>>>
>>>> >>>> Theoretically we could end up with iceberg implementers who have
>>>> bugs in this part of the code and we wouldn’t even know it was an issue
>>>> till someone converted the table to delta.
>>>> >>>
>>>> >>>
>>>> >>> I guess we could mandate readers validate all fields here to make
>>>> sure they are all consistent, even if unused.
>>>> >>>
>>>> >>> Separately, I think it might pay to take a step back and restate
>>>> desired requirements of this design (in no particular order):
>>>> >>> 1. The best possible implementation of DVs (limited redundancy, no
>>>> extraneous fields, CPU efficiency, minimal space, etc).
>>>> >>> 2.  The ability for Delta Lake readers to read Iceberg DVs
>>>> >>> 3.  The ability for Iceberg readers to read Delta Lake DVs
>>>> >>>
>>>> >>> The current proposal accomplishes 2 and 3 at very low cost with
>>>> some for cost for 1.  I still think 1 is important.  Table formats are
>>>> still going through a very large growth phase so taking suboptimal choices,
>>>> when there are better choices that don't add substantial cost, shouldn't be
>>>> done lightly.  Granted DVs are only going to be a very small part of the
>>>> cost of any table format.
>>>> >>>
>>>> >>> I think it is worth discussing other options to see if we think
>>>> there is a better one (if there isn't then I would propose we continue with
>>>> the current proposal).  Please chime in if I missed one but off the top of
>>>> my head these are:
>>>> >>>
>>>> >>> a.  Go forward with current proposal
>>>> >>> b.  Create a different format DV that we feel is a better, and take
>>>> no additional steps for compatibility with Delta Lake.
>>>> >>> c.  Create a different format DV that we feel is a better, and
>>>> allow backwards compatibility by adding "reader" support for Delta Lake DVs
>>>> in the spec, but not "writer support".
>>>> >>> d.  Go forward with the current proposal but use offset and length
>>>> to trim off the "offset" bytes.  (I assume this would break Delta Lake
>>>> Readers but I think Iceberg Readers could still read Delta Lake tables).
>>>> This option is very close to C but doesn't address all concerns around DV
>>>> format).
>>>> >>>
>>>> >>> Out of these three, my slight preference would be option c (add
>>>> migration capabilities from Delta Lake to Iceberg), followed by option a
>>>> (current proposal).
>>>> >>>
>>>> >>> Cheers,
>>>> >>> Micah
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Tue, Oct 15, 2024 at 9:32 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>> >>>>
>>>> >>>> @Scott We would have the ability to read delta vectors regardless
>>>> of what we pick since on  Iceberg side we really just need the bitmap and
>>>> what offset it is located at within a file, everything else could be in the
>>>> Iceberg metadata. We don’t have any disagreement on this aspect I think.
>>>> >>>>
>>>> >>>> The question is whether we would write additional Delta specific
>>>> metadata into the vector itself that an Iceberg implementation would not
>>>> use so that current Delta readers could read Iceberg delete vectors without
>>>> a code change or rewriting the vectors. The underlying representation would
>>>> still be the same between the two formats.
>>>> >>>>
>>>> >>>> The pros to doing this are that a reverse translation of iceberg
>>>> to delta would be much simpler.  Any implementers who already have delta
>>>> vector read code can probably mostly reuse it although our metadata would
>>>> let them skip to just reading the bitmap.
>>>> >>>>
>>>> >>>> The cons are that the metadata being written isn’t used by Iceberg
>>>> so any real tests would require using a delta reader, anything else would
>>>> just be synthetic. Theoretically we could end up with iceberg implementers
>>>> who have bugs in this part of the code and we wouldn’t even know it was an
>>>> issue till someone converted the table to delta. Other iceberg readers
>>>> would just be ignoring these bytes, so we essentially are adding a
>>>> requirement and complexity (although not that much) to Iceberg writers for
>>>> the benefit of current Delta readers. Delta would probably also have to add
>>>> a new fields to their metadata representations to capture the vector
>>>> metadata to handle our bitmaps.
>>>> >>>>
>>>> >>>> On Tue, Oct 15, 2024 at 5:56 PM Scott Cowell <
>>>> scott.cow...@dremio.com.invalid> wrote:
>>>> >>>>>
>>>> >>>>> From an engine perspective I think compatibility between Delta
>>>> and Iceberg on DVs is a great thing to have.  The additions for
>>>> cross-compat seem a minor thing to me that is vastly outweighed by a future
>>>> where Delta tables with DVs were supported in Delta Uniform and could be
>>>> read by any Iceberg V3 compliant engine.
>>>> >>>>>
>>>> >>>>> -Scott
>>>> >>>>>
>>>> >>>>> On Tue, Oct 15, 2024 at 2:06 PM Anton Okolnychyi <
>>>> aokolnyc...@gmail.com> wrote:
>>>> >>>>>>
>>>> >>>>>> Are there engines/vendors/companies in the community that
>>>> support both Iceberg and Delta and would benefit from having one blob
>>>> layout for DVs?
>>>> >>>>>>
>>>> >>>>>> - Anton
>>>> >>>>>>
>>>> >>>>>> вт, 15 жовт. 2024 р. о 11:10 rdb...@gmail.com <rdb...@gmail.com>
>>>> пише:
>>>> >>>>>>>
>>>> >>>>>>> Thanks, Szehon.
>>>> >>>>>>>
>>>> >>>>>>> To clarify on compatibility, using the same format for the
>>>> blobs makes it so that existing Delta readers can read and use the DVs
>>>> written by Iceberg. I'd love for Delta to adopt Puffin, but if we adopt the
>>>> extra fields they would not need to change how readers work. That's why I
>>>> think there is a benefit to using the same format. We avoid fragmentation
>>>> and make sure data and delete files are compatible. No unnecessary
>>>> fragmentation.
>>>> >>>>>>>
>>>> >>>>>>> Ryan
>>>> >>>>>>>
>>>> >>>>>>> On Tue, Oct 15, 2024 at 10:57 AM Szehon Ho <
>>>> szehon.apa...@gmail.com> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> This is awesome work by Anton and Ryan, it looks like a ton of
>>>> effort has gone into the V3 position vector proposal to make it clean and
>>>> efficient, a long time coming and Im truly excited to see the great
>>>> improvement in storage/perf.
>>>> >>>>>>>>
>>>> >>>>>>>> wrt to these fields, I think most of the concerns are already
>>>> mentioned by the other community members in the prs
>>>> https://github.com/apache/iceberg/pull/11238 and
>>>> https://github.com/apache/iceberg/pull/11238, so not much to add.  The
>>>> DV itself is RoaringBitmap 64-bit format so that's great, the argument for
>>>> CRC seems reasonable, and I dont have enough data to be opinionated towards
>>>> endian/magic byte.
>>>> >>>>>>>>
>>>> >>>>>>>> But I do lean towards the many PR comments that the extra
>>>> length field is unnecessary, and would just add confusion.  It seemed to me
>>>> that the Iceberg community has made so much effort to trim to spec to the
>>>> bare minimum for cleanliness and efficiency, so I do feel the field is not
>>>> in the normal direction of the project.  Also Im not clear on the plan for
>>>> old Delta readers, they cant read Puffin anyway, if Delta adopts Puffin,
>>>> then new readers could adopt?  Anyway great work again, thanks for raising
>>>> the issue on devlist!
>>>> >>>>>>>>
>>>> >>>>>>>> Thanks,
>>>> >>>>>>>> Szehon
>>>> >>>>>>>>
>>>> >>>>>>>> On Mon, Oct 14, 2024 at 5:14 PM rdb...@gmail.com <
>>>> rdb...@gmail.com> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> > I think it might be worth mentioning the current proposal
>>>> makes some, mostly minor, design choices to try to be compatible with Delta
>>>> Lake deletion vectors.
>>>> >>>>>>>>>
>>>> >>>>>>>>> Yes it does, and thanks for pointing this out, Micah. I think
>>>> it's important to consider whether compatibility is important to this
>>>> community. I just replied to Piotr on the PR, but I'll adapt some of that
>>>> response here to reach the broader community.
>>>> >>>>>>>>>
>>>> >>>>>>>>> I think there is value in supporting compatibility with older
>>>> Delta readers, but I acknowledge that this may be my perspective because my
>>>> employer has a lot of Delta customers that we are going to support now and
>>>> in the future.
>>>> >>>>>>>>>
>>>> >>>>>>>>> The main use case for maintaining compatibility with the
>>>> Delta format is that it's hard to move old jobs to new code in a migration.
>>>> We see a similar issue in Hive to Iceberg migrations, where unknown older
>>>> readers prevent migration entirely because they are hard to track down and
>>>> often read files directly from the backing object store. I'd like to avoid
>>>> the same problem here, where all readers need to be identified and migrated
>>>> at the same time. Compatibility with the format those readers expect makes
>>>> it possible to maintain Delta metadata for them temporarily. That increases
>>>> confidence that things won't randomly break and makes it easier to get
>>>> people to move forward.
>>>> >>>>>>>>>
>>>> >>>>>>>>> The second reason for maintaining compatibility is that we
>>>> want for the formats to become more similar. My hope is that we can work
>>>> across both communities and come up with a common metadata format in a
>>>> future version -- which explains my interest in smooth migrations.
>>>> Maintaining compatibility in cases like this builds trust and keeps our
>>>> options open: if we have compatible data layers, then it's easier to build
>>>> a compatible metadata layer. I'm hoping that if we make the blob format
>>>> compatible, we can get the Delta community to start using Puffin for better
>>>> self-describing delete files.
>>>> >>>>>>>>>
>>>> >>>>>>>>> Other people may not share those goals, so I think it helps
>>>> to consider what is being compromised for this compatibility. I don't think
>>>> it is too much. There are 2 additional fields:
>>>> >>>>>>>>> * A 4-byte length field (that Iceberg doesn't need)
>>>> >>>>>>>>> * A 4-byte CRC to validate the contents of the bitmap
>>>> >>>>>>>>>
>>>> >>>>>>>>> There are also changes to how these would have been added if
>>>> the Iceberg community were building this independently.
>>>> >>>>>>>>> * Our initial version didn't include a CRC at all, but now
>>>> that we think it's useful compatibility means using a CRC-32 checksum
>>>> rather than a newer one
>>>> >>>>>>>>> * The Delta format uses big endian for its fields (or mixed
>>>> endian if you consider RoaringBitmap is LE)
>>>> >>>>>>>>> * The magic bytes (added to avoid reading the Puffin footer)
>>>> would have been different
>>>> >>>>>>>>>
>>>> >>>>>>>>> Overall, I don't think that those changes to what we would
>>>> have done are unreasonable. It's only 8 extra bytes and half of them are
>>>> for a checksum that is a good idea.
>>>> >>>>>>>>>
>>>> >>>>>>>>> I'm looking forward to what the rest of the community thinks
>>>> about this. Thanks for reviewing the PR!
>>>> >>>>>>>>>
>>>> >>>>>>>>> Ryan
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Sun, Oct 13, 2024 at 10:45 PM Jean-Baptiste Onofré <
>>>> j...@nanthrax.net> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Hi
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Thanks for the PRs ! I reviewed Anton's document, I will do
>>>> a pass on the PRs.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Imho, it's important to get feedback from query engines, as,
>>>> if delete
>>>> >>>>>>>>>> vectors is not a problem per se (it's what we are using as
>>>> internal
>>>> >>>>>>>>>> representation), the use of Puffin files to store it is
>>>> "impactful"
>>>> >>>>>>>>>> for the query engines (probably some query engines might
>>>> need to
>>>> >>>>>>>>>> implement Puffin spec (read/write) using other language than
>>>> Java, for
>>>> >>>>>>>>>> instance Apache Impala).
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> I like the proposal, I just hope we won't "surprise" some
>>>> query
>>>> >>>>>>>>>> engines with extra work :)
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Regards
>>>> >>>>>>>>>> JB
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Thu, Oct 10, 2024 at 11:41 PM rdb...@gmail.com <
>>>> rdb...@gmail.com> wrote:
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > Hi everyone,
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > There seems to be broad agreement around Anton's proposal
>>>> to use deletion vectors in Iceberg v3, so I've opened two PRs that update
>>>> the spec with the proposed changes. The first, PR #11238, adds a new Puffin
>>>> blob type, delete-vector-v1, that stores a delete vector. The second, PR
>>>> #11240, updates the Iceberg table spec.
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > Please take a look and comment!
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > Ryan
>>>>
>>>

Re: Spec changes for deletion vectors

Reply via email to