Re: Spec changes for deletion vectors

Russell Spitzer Thu, 17 Oct 2024 14:51:48 -0700

For the conversion from Delta to Iceberg, wouldn't we need to scan all of
the Delta Vectors if we choose a different CRC or other endian-ness? Does
delta mandate that writers also include this information in their metadata
files?


On Thu, Oct 17, 2024 at 4:26 PM Anton Okolnychyi <aokolnyc...@gmail.com>
wrote:

> We would want to have magic bytes + checksum as part of the blob in
> Iceberg, as discussed in the spec PRs. If we chose something other than CRC
> and/or use little endian for all parts of the blob, this would break the
> compatibility in either direction and would prevent the use case that Scott
> was mentioning.
>
> - Anton
>
> чт, 17 жовт. 2024 р. о 08:58 Bart Samwel <b...@databricks.com.invalid>
> пише:
>
>> I hope it's OK if I chime in. I'm one of the people responsible for the
>> format for position deletes that is used in Delta Lake and I've been
>> reading along with the discussion. Given that the main sticking point is
>> whether this compatibility is worth the associated "not pure" spec, I
>> figured that maybe I can mention what the consequences would be for the
>> Delta Lake developers and users, depending on the outcome of this
>> discussion. I can also give some historical background, in case people find
>> that interesting.
>>
>> 1) Historical background on why the Delta Lake format is the way it is.
>>
>> The reason that this length field was added on the Delta Lake side is
>> because we didn't have a framing format like Puffin. Like you, we wanted
>> the Deletion Vector files to be parseable by themselves, if only for
>> debugging purposes. If we could go back, then we might have adopted Puffin.
>> Or we would have made the pointers in the metadata point at only the blob +
>> CRC, and kept the length outside of it, in the framing format. But the
>> reality is that right now there are many clients out there that read the
>> current format, and we can't change this anymore. :( The endianness
>> difference is simply an unfortunate historical accident. They are at
>> different layers, and this was the first time we really did anything
>> binary-ish in Delta Lake, so we didn't actually have any consistent
>> baseline to be consistent with. We only noticed the difference once it had
>> "escaped" into the wild, and then it was too late.
>>
>> Am I super happy with it? No. Is it *terrible*? Well, not terrible
>> enough for us to go back and upgrade the protocol to fix it. It doesn't
>> lead to broken behavior. This is just a historical idiosyncrasy, and the
>> friction caused by protocol changes is much higher than any benefit from a
>> cleaner spec. So basically, we're stuck with it until the next time we do a
>> major overhaul of the protocol.
>>
>> (2) What are the consequences for Delta Lake if this is *not* made
>> compatible?
>>
>> Well, then we'd have to support this new layout in Delta Lake. This would
>> be a long and relatively painful process.
>>
>> It would not just be a matter of "retconning" it into the protocol and
>> updating the libraries. There are simply too many connectors out there,
>> owned by different vendors etc. Until they would adopt the change, they
>> would simply error out on these files *at runtime with weird errors*, or
>> potentially even use the invalid values and crash and burn. (Lack of proper
>> input validation is unfortunately a real thing in the wild.)
>>
>> So instead, what we would do is to add this in a new protocol version of
>> Delta Lake. Or actually, it would be a "table feature", since Delta Lake
>> has a-la-carte protocol features. But these features tend to take a long
>> time to fully permeate the connector ecosystem, and people don't actually
>> upgrade their systems very quickly. That means that realistically, nobody
>> would be able to make use of this for quite a while.
>>
>> So what would need to happen instead? For now we would have to rewrite
>> the delete files on conversion, only to add this annoying little length
>> field. This would add at least 200 ms of latency to any metadata
>> conversion, if only because of the cloud object storage GET and PUT
>> latency. Furthermore, the conversion latency for a single commit would
>> become dependent on the number of delete files instead of being O(1). And
>> it would take significant development time to actually make this work and
>> to make this scale.
>>
>> Based on these consequences, you can imagine why I would *really*
>> appreciate it if the community could weigh this aspect as part of their
>> deliberations.
>>
>> (3) Is Iceberg -> Delta Lake compatibility actually important enough to
>> care about?
>>
>> From where I'm standing, compatibility is nearly always very important.
>> It's not important for users who have standardized fully on Iceberg, and
>> those are probably the most represented here in the dev community. But in
>> the world that I'm seeing, companies are generally using a mixture of many
>> different systems, and they are suffering because of the inability for
>> systems to operate efficiently on each others' data. Being able to convert
>> easily and efficiently in both directions benefits users. In this case it's
>> about Iceberg and Delta Lake, but IMO this is true as a principle
>> regardless of which systems you're talking about -- lower friction for
>> interoperability is very high value because it increases users' choice in
>> the tools that they can use -- it allows them to choose the right tool for
>> the job at hand. And it doesn't matter if users are converting from Delta
>> Lake to Iceberg or the other way around, they are in fact all Iceberg users!
>>
>> Putting it simply: I have heard many users complain that they can't
>> (efficiently) read data from system X in system Y. At the same time, I have
>> never heard a user complaining about having inconsistent endianness in
>> their protocols.
>>
>> On Thu, Oct 17, 2024 at 11:02 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>>> Hi folks,
>>>
>>> As Daniel said, I think we have actually two proposals in one:
>>> 1. The first proposal is "improvement of positional delete files",
>>> using delete vectors stored in Puffin files. I like this proposal, it
>>> makes a lot of sense. I think with a kind of consensus here (we
>>> discussed about how to parse Puffin files, etc, good discussion).
>>> 2. Then, based on (1), is support vector format "compatible" with
>>> Delta. This is also interesting. However, do we really need this in
>>> Spec V3 ? Why not focus on the original proposal (improvement of
>>> positional delete) with a simple approach, and evaluate Delta
>>> compatibility later ? If the compatibility is "easy", I'm not against
>>> to include in V3, but users might be disappointed if bringing this
>>> means a tradeoff.
>>>
>>> Imho, I will focus on 1 because it would be a great feature for the
>>> Iceberg community.
>>>
>>> Regards
>>> JB
>>>
>>> On Wed, Oct 16, 2024 at 9:16 PM Daniel Weeks <dwe...@apache.org> wrote:
>>> >
>>> > Hey Everyone,
>>> >
>>> > I feel like at this point we've articulated all of the various options
>>> and paths forward, but this really just comes down to a matter of whether
>>> we want to make a concession here for the purpose of compatibility.
>>> >
>>> > If we were building this with no prior art, I would expect to omit the
>>> length and align the endianness, but given there's an opportunity to close
>>> the gap with minor inefficiency, it merits real consideration.
>>> >
>>> > This proposal takes into consideration bi-directional compatibility
>>> while maintaining backward compatibility.  Do we feel this is beneficial to
>>> the larger community or should we discard efforts for compatibility?
>>> >
>>> > -Dan
>>> >
>>> > On Wed, Oct 16, 2024 at 11:01 AM rdb...@gmail.com <rdb...@gmail.com>
>>> wrote:
>>> >>
>>> >> Thanks, Russell for the clear summary of the pros and cons! I agree
>>> there's some risk to Iceberg implementations, but I think that is mitigated
>>> somewhat by code reuse. For example, an engine like Trino could simply
>>> reuse code for reading Delta bitmaps, so we would get some validation and
>>> support more easily.
>>> >>
>>> >> Micah, I agree with the requirements that you listed, but I would say
>>> #2 is not yet a "requirement" for the design. It's a consideration that I
>>> think has real value, but it's up to the community whether we want to add
>>> some cost to #1 to make #2 happen. I definitely think that #3 is a
>>> requirement so that we can convert Delta to Iceberg metadata (as in the
>>> iceberg-delta-lake module).
>>> >>
>>> >> For the set of options, I would collapse a few of those options
>>> because I think that we would use the same bitmap representation, the
>>> portable 64-bit roaring bitmap.
>>> >>
>>> >> If that's the case (and probably even if we had some other
>>> representation), then Delta can always add support for reading Iceberg
>>> delete vectors. That means we either go with the current proposal (a) that
>>> preserves the ability for existing Delta clients to read, or we go with a
>>> different proposal that we think is better, in which case Delta adds
>>> support.
>>> >>
>>> >> I think both options (c) and (d) have the same effect: Delta readers
>>> need to change and that breaks forward compatibility. Specifically:
>>> >> * I think that Option (c) would mean that we set the offset to either
>>> magic bytes or directly to the start of the roaring bitmap, so I think we
>>> will almost certainly be able to read Delta DVs. Even if we didn't have a
>>> similar bitmap encoding, we would probably end up adding support for
>>> reading Delta DVs for iceberg-delta-lake. Then it's a question of whether
>>> support for converted files is required -- similar to how we handle missing
>>> partition values in data files from Hive tables that we just updated the
>>> spec to clarify.
>>> >> * Option (d) is still incompatible with existing Delta readers, so
>>> there isn't much of a difference between this and (b)
>>> >>
>>> >> To me, Micah's requirement #2 is a good goal, but needs to be
>>> balanced against the cost. I don't see that cost as too high, and I think
>>> avoiding fragmentation across the projects helps us work together more in
>>> the future. But again, that may be my goal and not a priority for the
>>> broader Iceberg community.
>>> >>
>>> >> Ryan
>>> >>
>>> >> On Wed, Oct 16, 2024 at 10:10 AM Micah Kornfield <
>>> emkornfi...@gmail.com> wrote:
>>> >>>
>>> >>> One small point
>>> >>>>
>>> >>>> Theoretically we could end up with iceberg implementers who have
>>> bugs in this part of the code and we wouldn’t even know it was an issue
>>> till someone converted the table to delta.
>>> >>>
>>> >>>
>>> >>> I guess we could mandate readers validate all fields here to make
>>> sure they are all consistent, even if unused.
>>> >>>
>>> >>> Separately, I think it might pay to take a step back and restate
>>> desired requirements of this design (in no particular order):
>>> >>> 1. The best possible implementation of DVs (limited redundancy, no
>>> extraneous fields, CPU efficiency, minimal space, etc).
>>> >>> 2.  The ability for Delta Lake readers to read Iceberg DVs
>>> >>> 3.  The ability for Iceberg readers to read Delta Lake DVs
>>> >>>
>>> >>> The current proposal accomplishes 2 and 3 at very low cost with some
>>> for cost for 1.  I still think 1 is important.  Table formats are still
>>> going through a very large growth phase so taking suboptimal choices, when
>>> there are better choices that don't add substantial cost, shouldn't be done
>>> lightly.  Granted DVs are only going to be a very small part of the cost of
>>> any table format.
>>> >>>
>>> >>> I think it is worth discussing other options to see if we think
>>> there is a better one (if there isn't then I would propose we continue with
>>> the current proposal).  Please chime in if I missed one but off the top of
>>> my head these are:
>>> >>>
>>> >>> a.  Go forward with current proposal
>>> >>> b.  Create a different format DV that we feel is a better, and take
>>> no additional steps for compatibility with Delta Lake.
>>> >>> c.  Create a different format DV that we feel is a better, and allow
>>> backwards compatibility by adding "reader" support for Delta Lake DVs in
>>> the spec, but not "writer support".
>>> >>> d.  Go forward with the current proposal but use offset and length
>>> to trim off the "offset" bytes.  (I assume this would break Delta Lake
>>> Readers but I think Iceberg Readers could still read Delta Lake tables).
>>> This option is very close to C but doesn't address all concerns around DV
>>> format).
>>> >>>
>>> >>> Out of these three, my slight preference would be option c (add
>>> migration capabilities from Delta Lake to Iceberg), followed by option a
>>> (current proposal).
>>> >>>
>>> >>> Cheers,
>>> >>> Micah
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Oct 15, 2024 at 9:32 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>> >>>>
>>> >>>> @Scott We would have the ability to read delta vectors regardless
>>> of what we pick since on  Iceberg side we really just need the bitmap and
>>> what offset it is located at within a file, everything else could be in the
>>> Iceberg metadata. We don’t have any disagreement on this aspect I think.
>>> >>>>
>>> >>>> The question is whether we would write additional Delta specific
>>> metadata into the vector itself that an Iceberg implementation would not
>>> use so that current Delta readers could read Iceberg delete vectors without
>>> a code change or rewriting the vectors. The underlying representation would
>>> still be the same between the two formats.
>>> >>>>
>>> >>>> The pros to doing this are that a reverse translation of iceberg to
>>> delta would be much simpler.  Any implementers who already have delta
>>> vector read code can probably mostly reuse it although our metadata would
>>> let them skip to just reading the bitmap.
>>> >>>>
>>> >>>> The cons are that the metadata being written isn’t used by Iceberg
>>> so any real tests would require using a delta reader, anything else would
>>> just be synthetic. Theoretically we could end up with iceberg implementers
>>> who have bugs in this part of the code and we wouldn’t even know it was an
>>> issue till someone converted the table to delta. Other iceberg readers
>>> would just be ignoring these bytes, so we essentially are adding a
>>> requirement and complexity (although not that much) to Iceberg writers for
>>> the benefit of current Delta readers. Delta would probably also have to add
>>> a new fields to their metadata representations to capture the vector
>>> metadata to handle our bitmaps.
>>> >>>>
>>> >>>> On Tue, Oct 15, 2024 at 5:56 PM Scott Cowell <
>>> scott.cow...@dremio.com.invalid> wrote:
>>> >>>>>
>>> >>>>> From an engine perspective I think compatibility between Delta and
>>> Iceberg on DVs is a great thing to have.  The additions for cross-compat
>>> seem a minor thing to me that is vastly outweighed by a future where Delta
>>> tables with DVs were supported in Delta Uniform and could be read by any
>>> Iceberg V3 compliant engine.
>>> >>>>>
>>> >>>>> -Scott
>>> >>>>>
>>> >>>>> On Tue, Oct 15, 2024 at 2:06 PM Anton Okolnychyi <
>>> aokolnyc...@gmail.com> wrote:
>>> >>>>>>
>>> >>>>>> Are there engines/vendors/companies in the community that support
>>> both Iceberg and Delta and would benefit from having one blob layout for
>>> DVs?
>>> >>>>>>
>>> >>>>>> - Anton
>>> >>>>>>
>>> >>>>>> вт, 15 жовт. 2024 р. о 11:10 rdb...@gmail.com <rdb...@gmail.com>
>>> пише:
>>> >>>>>>>
>>> >>>>>>> Thanks, Szehon.
>>> >>>>>>>
>>> >>>>>>> To clarify on compatibility, using the same format for the blobs
>>> makes it so that existing Delta readers can read and use the DVs written by
>>> Iceberg. I'd love for Delta to adopt Puffin, but if we adopt the extra
>>> fields they would not need to change how readers work. That's why I think
>>> there is a benefit to using the same format. We avoid fragmentation and
>>> make sure data and delete files are compatible. No unnecessary
>>> fragmentation.
>>> >>>>>>>
>>> >>>>>>> Ryan
>>> >>>>>>>
>>> >>>>>>> On Tue, Oct 15, 2024 at 10:57 AM Szehon Ho <
>>> szehon.apa...@gmail.com> wrote:
>>> >>>>>>>>
>>> >>>>>>>> This is awesome work by Anton and Ryan, it looks like a ton of
>>> effort has gone into the V3 position vector proposal to make it clean and
>>> efficient, a long time coming and Im truly excited to see the great
>>> improvement in storage/perf.
>>> >>>>>>>>
>>> >>>>>>>> wrt to these fields, I think most of the concerns are already
>>> mentioned by the other community members in the prs
>>> https://github.com/apache/iceberg/pull/11238 and
>>> https://github.com/apache/iceberg/pull/11238, so not much to add.  The
>>> DV itself is RoaringBitmap 64-bit format so that's great, the argument for
>>> CRC seems reasonable, and I dont have enough data to be opinionated towards
>>> endian/magic byte.
>>> >>>>>>>>
>>> >>>>>>>> But I do lean towards the many PR comments that the extra
>>> length field is unnecessary, and would just add confusion.  It seemed to me
>>> that the Iceberg community has made so much effort to trim to spec to the
>>> bare minimum for cleanliness and efficiency, so I do feel the field is not
>>> in the normal direction of the project.  Also Im not clear on the plan for
>>> old Delta readers, they cant read Puffin anyway, if Delta adopts Puffin,
>>> then new readers could adopt?  Anyway great work again, thanks for raising
>>> the issue on devlist!
>>> >>>>>>>>
>>> >>>>>>>> Thanks,
>>> >>>>>>>> Szehon
>>> >>>>>>>>
>>> >>>>>>>> On Mon, Oct 14, 2024 at 5:14 PM rdb...@gmail.com <
>>> rdb...@gmail.com> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> > I think it might be worth mentioning the current proposal
>>> makes some, mostly minor, design choices to try to be compatible with Delta
>>> Lake deletion vectors.
>>> >>>>>>>>>
>>> >>>>>>>>> Yes it does, and thanks for pointing this out, Micah. I think
>>> it's important to consider whether compatibility is important to this
>>> community. I just replied to Piotr on the PR, but I'll adapt some of that
>>> response here to reach the broader community.
>>> >>>>>>>>>
>>> >>>>>>>>> I think there is value in supporting compatibility with older
>>> Delta readers, but I acknowledge that this may be my perspective because my
>>> employer has a lot of Delta customers that we are going to support now and
>>> in the future.
>>> >>>>>>>>>
>>> >>>>>>>>> The main use case for maintaining compatibility with the Delta
>>> format is that it's hard to move old jobs to new code in a migration. We
>>> see a similar issue in Hive to Iceberg migrations, where unknown older
>>> readers prevent migration entirely because they are hard to track down and
>>> often read files directly from the backing object store. I'd like to avoid
>>> the same problem here, where all readers need to be identified and migrated
>>> at the same time. Compatibility with the format those readers expect makes
>>> it possible to maintain Delta metadata for them temporarily. That increases
>>> confidence that things won't randomly break and makes it easier to get
>>> people to move forward.
>>> >>>>>>>>>
>>> >>>>>>>>> The second reason for maintaining compatibility is that we
>>> want for the formats to become more similar. My hope is that we can work
>>> across both communities and come up with a common metadata format in a
>>> future version -- which explains my interest in smooth migrations.
>>> Maintaining compatibility in cases like this builds trust and keeps our
>>> options open: if we have compatible data layers, then it's easier to build
>>> a compatible metadata layer. I'm hoping that if we make the blob format
>>> compatible, we can get the Delta community to start using Puffin for better
>>> self-describing delete files.
>>> >>>>>>>>>
>>> >>>>>>>>> Other people may not share those goals, so I think it helps to
>>> consider what is being compromised for this compatibility. I don't think it
>>> is too much. There are 2 additional fields:
>>> >>>>>>>>> * A 4-byte length field (that Iceberg doesn't need)
>>> >>>>>>>>> * A 4-byte CRC to validate the contents of the bitmap
>>> >>>>>>>>>
>>> >>>>>>>>> There are also changes to how these would have been added if
>>> the Iceberg community were building this independently.
>>> >>>>>>>>> * Our initial version didn't include a CRC at all, but now
>>> that we think it's useful compatibility means using a CRC-32 checksum
>>> rather than a newer one
>>> >>>>>>>>> * The Delta format uses big endian for its fields (or mixed
>>> endian if you consider RoaringBitmap is LE)
>>> >>>>>>>>> * The magic bytes (added to avoid reading the Puffin footer)
>>> would have been different
>>> >>>>>>>>>
>>> >>>>>>>>> Overall, I don't think that those changes to what we would
>>> have done are unreasonable. It's only 8 extra bytes and half of them are
>>> for a checksum that is a good idea.
>>> >>>>>>>>>
>>> >>>>>>>>> I'm looking forward to what the rest of the community thinks
>>> about this. Thanks for reviewing the PR!
>>> >>>>>>>>>
>>> >>>>>>>>> Ryan
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> On Sun, Oct 13, 2024 at 10:45 PM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> Hi
>>> >>>>>>>>>>
>>> >>>>>>>>>> Thanks for the PRs ! I reviewed Anton's document, I will do a
>>> pass on the PRs.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Imho, it's important to get feedback from query engines, as,
>>> if delete
>>> >>>>>>>>>> vectors is not a problem per se (it's what we are using as
>>> internal
>>> >>>>>>>>>> representation), the use of Puffin files to store it is
>>> "impactful"
>>> >>>>>>>>>> for the query engines (probably some query engines might need
>>> to
>>> >>>>>>>>>> implement Puffin spec (read/write) using other language than
>>> Java, for
>>> >>>>>>>>>> instance Apache Impala).
>>> >>>>>>>>>>
>>> >>>>>>>>>> I like the proposal, I just hope we won't "surprise" some
>>> query
>>> >>>>>>>>>> engines with extra work :)
>>> >>>>>>>>>>
>>> >>>>>>>>>> Regards
>>> >>>>>>>>>> JB
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Thu, Oct 10, 2024 at 11:41 PM rdb...@gmail.com <
>>> rdb...@gmail.com> wrote:
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > Hi everyone,
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > There seems to be broad agreement around Anton's proposal
>>> to use deletion vectors in Iceberg v3, so I've opened two PRs that update
>>> the spec with the proposed changes. The first, PR #11238, adds a new Puffin
>>> blob type, delete-vector-v1, that stores a delete vector. The second, PR
>>> #11240, updates the Iceberg table spec.
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > Please take a look and comment!
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > Ryan
>>>
>>

Re: Spec changes for deletion vectors

Reply via email to