Re: Spec changes for deletion vectors

Anton Okolnychyi Thu, 17 Oct 2024 14:26:04 -0700

We would want to have magic bytes + checksum as part of the blob in
Iceberg, as discussed in the spec PRs. If we chose something other than CRC
and/or use little endian for all parts of the blob, this would break the
compatibility in either direction and would prevent the use case that Scott
was mentioning.


- Anton

чт, 17 жовт. 2024 р. о 08:58 Bart Samwel <b...@databricks.com.invalid> пише:

> I hope it's OK if I chime in. I'm one of the people responsible for the
> format for position deletes that is used in Delta Lake and I've been
> reading along with the discussion. Given that the main sticking point is
> whether this compatibility is worth the associated "not pure" spec, I
> figured that maybe I can mention what the consequences would be for the
> Delta Lake developers and users, depending on the outcome of this
> discussion. I can also give some historical background, in case people find
> that interesting.
>
> 1) Historical background on why the Delta Lake format is the way it is.
>
> The reason that this length field was added on the Delta Lake side is
> because we didn't have a framing format like Puffin. Like you, we wanted
> the Deletion Vector files to be parseable by themselves, if only for
> debugging purposes. If we could go back, then we might have adopted Puffin.
> Or we would have made the pointers in the metadata point at only the blob +
> CRC, and kept the length outside of it, in the framing format. But the
> reality is that right now there are many clients out there that read the
> current format, and we can't change this anymore. :( The endianness
> difference is simply an unfortunate historical accident. They are at
> different layers, and this was the first time we really did anything
> binary-ish in Delta Lake, so we didn't actually have any consistent
> baseline to be consistent with. We only noticed the difference once it had
> "escaped" into the wild, and then it was too late.
>
> Am I super happy with it? No. Is it *terrible*? Well, not terrible enough
> for us to go back and upgrade the protocol to fix it. It doesn't lead to
> broken behavior. This is just a historical idiosyncrasy, and the friction
> caused by protocol changes is much higher than any benefit from a cleaner
> spec. So basically, we're stuck with it until the next time we do a major
> overhaul of the protocol.
>
> (2) What are the consequences for Delta Lake if this is *not* made
> compatible?
>
> Well, then we'd have to support this new layout in Delta Lake. This would
> be a long and relatively painful process.
>
> It would not just be a matter of "retconning" it into the protocol and
> updating the libraries. There are simply too many connectors out there,
> owned by different vendors etc. Until they would adopt the change, they
> would simply error out on these files *at runtime with weird errors*, or
> potentially even use the invalid values and crash and burn. (Lack of proper
> input validation is unfortunately a real thing in the wild.)
>
> So instead, what we would do is to add this in a new protocol version of
> Delta Lake. Or actually, it would be a "table feature", since Delta Lake
> has a-la-carte protocol features. But these features tend to take a long
> time to fully permeate the connector ecosystem, and people don't actually
> upgrade their systems very quickly. That means that realistically, nobody
> would be able to make use of this for quite a while.
>
> So what would need to happen instead? For now we would have to rewrite the
> delete files on conversion, only to add this annoying little length field.
> This would add at least 200 ms of latency to any metadata conversion, if
> only because of the cloud object storage GET and PUT latency. Furthermore,
> the conversion latency for a single commit would become dependent on the
> number of delete files instead of being O(1). And it would take significant
> development time to actually make this work and to make this scale.
>
> Based on these consequences, you can imagine why I would *really*
> appreciate it if the community could weigh this aspect as part of their
> deliberations.
>
> (3) Is Iceberg -> Delta Lake compatibility actually important enough to
> care about?
>
> From where I'm standing, compatibility is nearly always very important.
> It's not important for users who have standardized fully on Iceberg, and
> those are probably the most represented here in the dev community. But in
> the world that I'm seeing, companies are generally using a mixture of many
> different systems, and they are suffering because of the inability for
> systems to operate efficiently on each others' data. Being able to convert
> easily and efficiently in both directions benefits users. In this case it's
> about Iceberg and Delta Lake, but IMO this is true as a principle
> regardless of which systems you're talking about -- lower friction for
> interoperability is very high value because it increases users' choice in
> the tools that they can use -- it allows them to choose the right tool for
> the job at hand. And it doesn't matter if users are converting from Delta
> Lake to Iceberg or the other way around, they are in fact all Iceberg users!
>
> Putting it simply: I have heard many users complain that they can't
> (efficiently) read data from system X in system Y. At the same time, I have
> never heard a user complaining about having inconsistent endianness in
> their protocols.
>
> On Thu, Oct 17, 2024 at 11:02 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi folks,
>>
>> As Daniel said, I think we have actually two proposals in one:
>> 1. The first proposal is "improvement of positional delete files",
>> using delete vectors stored in Puffin files. I like this proposal, it
>> makes a lot of sense. I think with a kind of consensus here (we
>> discussed about how to parse Puffin files, etc, good discussion).
>> 2. Then, based on (1), is support vector format "compatible" with
>> Delta. This is also interesting. However, do we really need this in
>> Spec V3 ? Why not focus on the original proposal (improvement of
>> positional delete) with a simple approach, and evaluate Delta
>> compatibility later ? If the compatibility is "easy", I'm not against
>> to include in V3, but users might be disappointed if bringing this
>> means a tradeoff.
>>
>> Imho, I will focus on 1 because it would be a great feature for the
>> Iceberg community.
>>
>> Regards
>> JB
>>
>> On Wed, Oct 16, 2024 at 9:16 PM Daniel Weeks <dwe...@apache.org> wrote:
>> >
>> > Hey Everyone,
>> >
>> > I feel like at this point we've articulated all of the various options
>> and paths forward, but this really just comes down to a matter of whether
>> we want to make a concession here for the purpose of compatibility.
>> >
>> > If we were building this with no prior art, I would expect to omit the
>> length and align the endianness, but given there's an opportunity to close
>> the gap with minor inefficiency, it merits real consideration.
>> >
>> > This proposal takes into consideration bi-directional compatibility
>> while maintaining backward compatibility.  Do we feel this is beneficial to
>> the larger community or should we discard efforts for compatibility?
>> >
>> > -Dan
>> >
>> > On Wed, Oct 16, 2024 at 11:01 AM rdb...@gmail.com <rdb...@gmail.com>
>> wrote:
>> >>
>> >> Thanks, Russell for the clear summary of the pros and cons! I agree
>> there's some risk to Iceberg implementations, but I think that is mitigated
>> somewhat by code reuse. For example, an engine like Trino could simply
>> reuse code for reading Delta bitmaps, so we would get some validation and
>> support more easily.
>> >>
>> >> Micah, I agree with the requirements that you listed, but I would say
>> #2 is not yet a "requirement" for the design. It's a consideration that I
>> think has real value, but it's up to the community whether we want to add
>> some cost to #1 to make #2 happen. I definitely think that #3 is a
>> requirement so that we can convert Delta to Iceberg metadata (as in the
>> iceberg-delta-lake module).
>> >>
>> >> For the set of options, I would collapse a few of those options
>> because I think that we would use the same bitmap representation, the
>> portable 64-bit roaring bitmap.
>> >>
>> >> If that's the case (and probably even if we had some other
>> representation), then Delta can always add support for reading Iceberg
>> delete vectors. That means we either go with the current proposal (a) that
>> preserves the ability for existing Delta clients to read, or we go with a
>> different proposal that we think is better, in which case Delta adds
>> support.
>> >>
>> >> I think both options (c) and (d) have the same effect: Delta readers
>> need to change and that breaks forward compatibility. Specifically:
>> >> * I think that Option (c) would mean that we set the offset to either
>> magic bytes or directly to the start of the roaring bitmap, so I think we
>> will almost certainly be able to read Delta DVs. Even if we didn't have a
>> similar bitmap encoding, we would probably end up adding support for
>> reading Delta DVs for iceberg-delta-lake. Then it's a question of whether
>> support for converted files is required -- similar to how we handle missing
>> partition values in data files from Hive tables that we just updated the
>> spec to clarify.
>> >> * Option (d) is still incompatible with existing Delta readers, so
>> there isn't much of a difference between this and (b)
>> >>
>> >> To me, Micah's requirement #2 is a good goal, but needs to be balanced
>> against the cost. I don't see that cost as too high, and I think avoiding
>> fragmentation across the projects helps us work together more in the
>> future. But again, that may be my goal and not a priority for the broader
>> Iceberg community.
>> >>
>> >> Ryan
>> >>
>> >> On Wed, Oct 16, 2024 at 10:10 AM Micah Kornfield <
>> emkornfi...@gmail.com> wrote:
>> >>>
>> >>> One small point
>> >>>>
>> >>>> Theoretically we could end up with iceberg implementers who have
>> bugs in this part of the code and we wouldn’t even know it was an issue
>> till someone converted the table to delta.
>> >>>
>> >>>
>> >>> I guess we could mandate readers validate all fields here to make
>> sure they are all consistent, even if unused.
>> >>>
>> >>> Separately, I think it might pay to take a step back and restate
>> desired requirements of this design (in no particular order):
>> >>> 1. The best possible implementation of DVs (limited redundancy, no
>> extraneous fields, CPU efficiency, minimal space, etc).
>> >>> 2.  The ability for Delta Lake readers to read Iceberg DVs
>> >>> 3.  The ability for Iceberg readers to read Delta Lake DVs
>> >>>
>> >>> The current proposal accomplishes 2 and 3 at very low cost with some
>> for cost for 1.  I still think 1 is important.  Table formats are still
>> going through a very large growth phase so taking suboptimal choices, when
>> there are better choices that don't add substantial cost, shouldn't be done
>> lightly.  Granted DVs are only going to be a very small part of the cost of
>> any table format.
>> >>>
>> >>> I think it is worth discussing other options to see if we think there
>> is a better one (if there isn't then I would propose we continue with the
>> current proposal).  Please chime in if I missed one but off the top of my
>> head these are:
>> >>>
>> >>> a.  Go forward with current proposal
>> >>> b.  Create a different format DV that we feel is a better, and take
>> no additional steps for compatibility with Delta Lake.
>> >>> c.  Create a different format DV that we feel is a better, and allow
>> backwards compatibility by adding "reader" support for Delta Lake DVs in
>> the spec, but not "writer support".
>> >>> d.  Go forward with the current proposal but use offset and length to
>> trim off the "offset" bytes.  (I assume this would break Delta Lake Readers
>> but I think Iceberg Readers could still read Delta Lake tables). This
>> option is very close to C but doesn't address all concerns around DV
>> format).
>> >>>
>> >>> Out of these three, my slight preference would be option c (add
>> migration capabilities from Delta Lake to Iceberg), followed by option a
>> (current proposal).
>> >>>
>> >>> Cheers,
>> >>> Micah
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Oct 15, 2024 at 9:32 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>> >>>>
>> >>>> @Scott We would have the ability to read delta vectors regardless of
>> what we pick since on  Iceberg side we really just need the bitmap and what
>> offset it is located at within a file, everything else could be in the
>> Iceberg metadata. We don’t have any disagreement on this aspect I think.
>> >>>>
>> >>>> The question is whether we would write additional Delta specific
>> metadata into the vector itself that an Iceberg implementation would not
>> use so that current Delta readers could read Iceberg delete vectors without
>> a code change or rewriting the vectors. The underlying representation would
>> still be the same between the two formats.
>> >>>>
>> >>>> The pros to doing this are that a reverse translation of iceberg to
>> delta would be much simpler.  Any implementers who already have delta
>> vector read code can probably mostly reuse it although our metadata would
>> let them skip to just reading the bitmap.
>> >>>>
>> >>>> The cons are that the metadata being written isn’t used by Iceberg
>> so any real tests would require using a delta reader, anything else would
>> just be synthetic. Theoretically we could end up with iceberg implementers
>> who have bugs in this part of the code and we wouldn’t even know it was an
>> issue till someone converted the table to delta. Other iceberg readers
>> would just be ignoring these bytes, so we essentially are adding a
>> requirement and complexity (although not that much) to Iceberg writers for
>> the benefit of current Delta readers. Delta would probably also have to add
>> a new fields to their metadata representations to capture the vector
>> metadata to handle our bitmaps.
>> >>>>
>> >>>> On Tue, Oct 15, 2024 at 5:56 PM Scott Cowell <
>> scott.cow...@dremio.com.invalid> wrote:
>> >>>>>
>> >>>>> From an engine perspective I think compatibility between Delta and
>> Iceberg on DVs is a great thing to have.  The additions for cross-compat
>> seem a minor thing to me that is vastly outweighed by a future where Delta
>> tables with DVs were supported in Delta Uniform and could be read by any
>> Iceberg V3 compliant engine.
>> >>>>>
>> >>>>> -Scott
>> >>>>>
>> >>>>> On Tue, Oct 15, 2024 at 2:06 PM Anton Okolnychyi <
>> aokolnyc...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Are there engines/vendors/companies in the community that support
>> both Iceberg and Delta and would benefit from having one blob layout for
>> DVs?
>> >>>>>>
>> >>>>>> - Anton
>> >>>>>>
>> >>>>>> вт, 15 жовт. 2024 р. о 11:10 rdb...@gmail.com <rdb...@gmail.com>
>> пише:
>> >>>>>>>
>> >>>>>>> Thanks, Szehon.
>> >>>>>>>
>> >>>>>>> To clarify on compatibility, using the same format for the blobs
>> makes it so that existing Delta readers can read and use the DVs written by
>> Iceberg. I'd love for Delta to adopt Puffin, but if we adopt the extra
>> fields they would not need to change how readers work. That's why I think
>> there is a benefit to using the same format. We avoid fragmentation and
>> make sure data and delete files are compatible. No unnecessary
>> fragmentation.
>> >>>>>>>
>> >>>>>>> Ryan
>> >>>>>>>
>> >>>>>>> On Tue, Oct 15, 2024 at 10:57 AM Szehon Ho <
>> szehon.apa...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> This is awesome work by Anton and Ryan, it looks like a ton of
>> effort has gone into the V3 position vector proposal to make it clean and
>> efficient, a long time coming and Im truly excited to see the great
>> improvement in storage/perf.
>> >>>>>>>>
>> >>>>>>>> wrt to these fields, I think most of the concerns are already
>> mentioned by the other community members in the prs
>> https://github.com/apache/iceberg/pull/11238 and
>> https://github.com/apache/iceberg/pull/11238, so not much to add.  The
>> DV itself is RoaringBitmap 64-bit format so that's great, the argument for
>> CRC seems reasonable, and I dont have enough data to be opinionated towards
>> endian/magic byte.
>> >>>>>>>>
>> >>>>>>>> But I do lean towards the many PR comments that the extra length
>> field is unnecessary, and would just add confusion.  It seemed to me that
>> the Iceberg community has made so much effort to trim to spec to the bare
>> minimum for cleanliness and efficiency, so I do feel the field is not in
>> the normal direction of the project.  Also Im not clear on the plan for old
>> Delta readers, they cant read Puffin anyway, if Delta adopts Puffin, then
>> new readers could adopt?  Anyway great work again, thanks for raising the
>> issue on devlist!
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Szehon
>> >>>>>>>>
>> >>>>>>>> On Mon, Oct 14, 2024 at 5:14 PM rdb...@gmail.com <
>> rdb...@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> > I think it might be worth mentioning the current proposal
>> makes some, mostly minor, design choices to try to be compatible with Delta
>> Lake deletion vectors.
>> >>>>>>>>>
>> >>>>>>>>> Yes it does, and thanks for pointing this out, Micah. I think
>> it's important to consider whether compatibility is important to this
>> community. I just replied to Piotr on the PR, but I'll adapt some of that
>> response here to reach the broader community.
>> >>>>>>>>>
>> >>>>>>>>> I think there is value in supporting compatibility with older
>> Delta readers, but I acknowledge that this may be my perspective because my
>> employer has a lot of Delta customers that we are going to support now and
>> in the future.
>> >>>>>>>>>
>> >>>>>>>>> The main use case for maintaining compatibility with the Delta
>> format is that it's hard to move old jobs to new code in a migration. We
>> see a similar issue in Hive to Iceberg migrations, where unknown older
>> readers prevent migration entirely because they are hard to track down and
>> often read files directly from the backing object store. I'd like to avoid
>> the same problem here, where all readers need to be identified and migrated
>> at the same time. Compatibility with the format those readers expect makes
>> it possible to maintain Delta metadata for them temporarily. That increases
>> confidence that things won't randomly break and makes it easier to get
>> people to move forward.
>> >>>>>>>>>
>> >>>>>>>>> The second reason for maintaining compatibility is that we want
>> for the formats to become more similar. My hope is that we can work across
>> both communities and come up with a common metadata format in a future
>> version -- which explains my interest in smooth migrations. Maintaining
>> compatibility in cases like this builds trust and keeps our options open:
>> if we have compatible data layers, then it's easier to build a compatible
>> metadata layer. I'm hoping that if we make the blob format compatible, we
>> can get the Delta community to start using Puffin for better
>> self-describing delete files.
>> >>>>>>>>>
>> >>>>>>>>> Other people may not share those goals, so I think it helps to
>> consider what is being compromised for this compatibility. I don't think it
>> is too much. There are 2 additional fields:
>> >>>>>>>>> * A 4-byte length field (that Iceberg doesn't need)
>> >>>>>>>>> * A 4-byte CRC to validate the contents of the bitmap
>> >>>>>>>>>
>> >>>>>>>>> There are also changes to how these would have been added if
>> the Iceberg community were building this independently.
>> >>>>>>>>> * Our initial version didn't include a CRC at all, but now that
>> we think it's useful compatibility means using a CRC-32 checksum rather
>> than a newer one
>> >>>>>>>>> * The Delta format uses big endian for its fields (or mixed
>> endian if you consider RoaringBitmap is LE)
>> >>>>>>>>> * The magic bytes (added to avoid reading the Puffin footer)
>> would have been different
>> >>>>>>>>>
>> >>>>>>>>> Overall, I don't think that those changes to what we would have
>> done are unreasonable. It's only 8 extra bytes and half of them are for a
>> checksum that is a good idea.
>> >>>>>>>>>
>> >>>>>>>>> I'm looking forward to what the rest of the community thinks
>> about this. Thanks for reviewing the PR!
>> >>>>>>>>>
>> >>>>>>>>> Ryan
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Sun, Oct 13, 2024 at 10:45 PM Jean-Baptiste Onofré <
>> j...@nanthrax.net> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks for the PRs ! I reviewed Anton's document, I will do a
>> pass on the PRs.
>> >>>>>>>>>>
>> >>>>>>>>>> Imho, it's important to get feedback from query engines, as,
>> if delete
>> >>>>>>>>>> vectors is not a problem per se (it's what we are using as
>> internal
>> >>>>>>>>>> representation), the use of Puffin files to store it is
>> "impactful"
>> >>>>>>>>>> for the query engines (probably some query engines might need
>> to
>> >>>>>>>>>> implement Puffin spec (read/write) using other language than
>> Java, for
>> >>>>>>>>>> instance Apache Impala).
>> >>>>>>>>>>
>> >>>>>>>>>> I like the proposal, I just hope we won't "surprise" some query
>> >>>>>>>>>> engines with extra work :)
>> >>>>>>>>>>
>> >>>>>>>>>> Regards
>> >>>>>>>>>> JB
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Oct 10, 2024 at 11:41 PM rdb...@gmail.com <
>> rdb...@gmail.com> wrote:
>> >>>>>>>>>> >
>> >>>>>>>>>> > Hi everyone,
>> >>>>>>>>>> >
>> >>>>>>>>>> > There seems to be broad agreement around Anton's proposal to
>> use deletion vectors in Iceberg v3, so I've opened two PRs that update the
>> spec with the proposed changes. The first, PR #11238, adds a new Puffin
>> blob type, delete-vector-v1, that stores a delete vector. The second, PR
>> #11240, updates the Iceberg table spec.
>> >>>>>>>>>> >
>> >>>>>>>>>> > Please take a look and comment!
>> >>>>>>>>>> >
>> >>>>>>>>>> > Ryan
>>
>

Re: Spec changes for deletion vectors

Reply via email to