We would want to have magic bytes + checksum as part of the blob in Iceberg, as discussed in the spec PRs. If we chose something other than CRC and/or use little endian for all parts of the blob, this would break the compatibility in either direction and would prevent the use case that Scott was mentioning.
- Anton чт, 17 жовт. 2024 р. о 08:58 Bart Samwel <b...@databricks.com.invalid> пише: > I hope it's OK if I chime in. I'm one of the people responsible for the > format for position deletes that is used in Delta Lake and I've been > reading along with the discussion. Given that the main sticking point is > whether this compatibility is worth the associated "not pure" spec, I > figured that maybe I can mention what the consequences would be for the > Delta Lake developers and users, depending on the outcome of this > discussion. I can also give some historical background, in case people find > that interesting. > > 1) Historical background on why the Delta Lake format is the way it is. > > The reason that this length field was added on the Delta Lake side is > because we didn't have a framing format like Puffin. Like you, we wanted > the Deletion Vector files to be parseable by themselves, if only for > debugging purposes. If we could go back, then we might have adopted Puffin. > Or we would have made the pointers in the metadata point at only the blob + > CRC, and kept the length outside of it, in the framing format. But the > reality is that right now there are many clients out there that read the > current format, and we can't change this anymore. :( The endianness > difference is simply an unfortunate historical accident. They are at > different layers, and this was the first time we really did anything > binary-ish in Delta Lake, so we didn't actually have any consistent > baseline to be consistent with. We only noticed the difference once it had > "escaped" into the wild, and then it was too late. > > Am I super happy with it? No. Is it *terrible*? Well, not terrible enough > for us to go back and upgrade the protocol to fix it. It doesn't lead to > broken behavior. This is just a historical idiosyncrasy, and the friction > caused by protocol changes is much higher than any benefit from a cleaner > spec. So basically, we're stuck with it until the next time we do a major > overhaul of the protocol. > > (2) What are the consequences for Delta Lake if this is *not* made > compatible? > > Well, then we'd have to support this new layout in Delta Lake. This would > be a long and relatively painful process. > > It would not just be a matter of "retconning" it into the protocol and > updating the libraries. There are simply too many connectors out there, > owned by different vendors etc. Until they would adopt the change, they > would simply error out on these files *at runtime with weird errors*, or > potentially even use the invalid values and crash and burn. (Lack of proper > input validation is unfortunately a real thing in the wild.) > > So instead, what we would do is to add this in a new protocol version of > Delta Lake. Or actually, it would be a "table feature", since Delta Lake > has a-la-carte protocol features. But these features tend to take a long > time to fully permeate the connector ecosystem, and people don't actually > upgrade their systems very quickly. That means that realistically, nobody > would be able to make use of this for quite a while. > > So what would need to happen instead? For now we would have to rewrite the > delete files on conversion, only to add this annoying little length field. > This would add at least 200 ms of latency to any metadata conversion, if > only because of the cloud object storage GET and PUT latency. Furthermore, > the conversion latency for a single commit would become dependent on the > number of delete files instead of being O(1). And it would take significant > development time to actually make this work and to make this scale. > > Based on these consequences, you can imagine why I would *really* > appreciate it if the community could weigh this aspect as part of their > deliberations. > > (3) Is Iceberg -> Delta Lake compatibility actually important enough to > care about? > > From where I'm standing, compatibility is nearly always very important. > It's not important for users who have standardized fully on Iceberg, and > those are probably the most represented here in the dev community. But in > the world that I'm seeing, companies are generally using a mixture of many > different systems, and they are suffering because of the inability for > systems to operate efficiently on each others' data. Being able to convert > easily and efficiently in both directions benefits users. In this case it's > about Iceberg and Delta Lake, but IMO this is true as a principle > regardless of which systems you're talking about -- lower friction for > interoperability is very high value because it increases users' choice in > the tools that they can use -- it allows them to choose the right tool for > the job at hand. And it doesn't matter if users are converting from Delta > Lake to Iceberg or the other way around, they are in fact all Iceberg users! > > Putting it simply: I have heard many users complain that they can't > (efficiently) read data from system X in system Y. At the same time, I have > never heard a user complaining about having inconsistent endianness in > their protocols. > > On Thu, Oct 17, 2024 at 11:02 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi folks, >> >> As Daniel said, I think we have actually two proposals in one: >> 1. The first proposal is "improvement of positional delete files", >> using delete vectors stored in Puffin files. I like this proposal, it >> makes a lot of sense. I think with a kind of consensus here (we >> discussed about how to parse Puffin files, etc, good discussion). >> 2. Then, based on (1), is support vector format "compatible" with >> Delta. This is also interesting. However, do we really need this in >> Spec V3 ? Why not focus on the original proposal (improvement of >> positional delete) with a simple approach, and evaluate Delta >> compatibility later ? If the compatibility is "easy", I'm not against >> to include in V3, but users might be disappointed if bringing this >> means a tradeoff. >> >> Imho, I will focus on 1 because it would be a great feature for the >> Iceberg community. >> >> Regards >> JB >> >> On Wed, Oct 16, 2024 at 9:16 PM Daniel Weeks <dwe...@apache.org> wrote: >> > >> > Hey Everyone, >> > >> > I feel like at this point we've articulated all of the various options >> and paths forward, but this really just comes down to a matter of whether >> we want to make a concession here for the purpose of compatibility. >> > >> > If we were building this with no prior art, I would expect to omit the >> length and align the endianness, but given there's an opportunity to close >> the gap with minor inefficiency, it merits real consideration. >> > >> > This proposal takes into consideration bi-directional compatibility >> while maintaining backward compatibility. Do we feel this is beneficial to >> the larger community or should we discard efforts for compatibility? >> > >> > -Dan >> > >> > On Wed, Oct 16, 2024 at 11:01 AM rdb...@gmail.com <rdb...@gmail.com> >> wrote: >> >> >> >> Thanks, Russell for the clear summary of the pros and cons! I agree >> there's some risk to Iceberg implementations, but I think that is mitigated >> somewhat by code reuse. For example, an engine like Trino could simply >> reuse code for reading Delta bitmaps, so we would get some validation and >> support more easily. >> >> >> >> Micah, I agree with the requirements that you listed, but I would say >> #2 is not yet a "requirement" for the design. It's a consideration that I >> think has real value, but it's up to the community whether we want to add >> some cost to #1 to make #2 happen. I definitely think that #3 is a >> requirement so that we can convert Delta to Iceberg metadata (as in the >> iceberg-delta-lake module). >> >> >> >> For the set of options, I would collapse a few of those options >> because I think that we would use the same bitmap representation, the >> portable 64-bit roaring bitmap. >> >> >> >> If that's the case (and probably even if we had some other >> representation), then Delta can always add support for reading Iceberg >> delete vectors. That means we either go with the current proposal (a) that >> preserves the ability for existing Delta clients to read, or we go with a >> different proposal that we think is better, in which case Delta adds >> support. >> >> >> >> I think both options (c) and (d) have the same effect: Delta readers >> need to change and that breaks forward compatibility. Specifically: >> >> * I think that Option (c) would mean that we set the offset to either >> magic bytes or directly to the start of the roaring bitmap, so I think we >> will almost certainly be able to read Delta DVs. Even if we didn't have a >> similar bitmap encoding, we would probably end up adding support for >> reading Delta DVs for iceberg-delta-lake. Then it's a question of whether >> support for converted files is required -- similar to how we handle missing >> partition values in data files from Hive tables that we just updated the >> spec to clarify. >> >> * Option (d) is still incompatible with existing Delta readers, so >> there isn't much of a difference between this and (b) >> >> >> >> To me, Micah's requirement #2 is a good goal, but needs to be balanced >> against the cost. I don't see that cost as too high, and I think avoiding >> fragmentation across the projects helps us work together more in the >> future. But again, that may be my goal and not a priority for the broader >> Iceberg community. >> >> >> >> Ryan >> >> >> >> On Wed, Oct 16, 2024 at 10:10 AM Micah Kornfield < >> emkornfi...@gmail.com> wrote: >> >>> >> >>> One small point >> >>>> >> >>>> Theoretically we could end up with iceberg implementers who have >> bugs in this part of the code and we wouldn’t even know it was an issue >> till someone converted the table to delta. >> >>> >> >>> >> >>> I guess we could mandate readers validate all fields here to make >> sure they are all consistent, even if unused. >> >>> >> >>> Separately, I think it might pay to take a step back and restate >> desired requirements of this design (in no particular order): >> >>> 1. The best possible implementation of DVs (limited redundancy, no >> extraneous fields, CPU efficiency, minimal space, etc). >> >>> 2. The ability for Delta Lake readers to read Iceberg DVs >> >>> 3. The ability for Iceberg readers to read Delta Lake DVs >> >>> >> >>> The current proposal accomplishes 2 and 3 at very low cost with some >> for cost for 1. I still think 1 is important. Table formats are still >> going through a very large growth phase so taking suboptimal choices, when >> there are better choices that don't add substantial cost, shouldn't be done >> lightly. Granted DVs are only going to be a very small part of the cost of >> any table format. >> >>> >> >>> I think it is worth discussing other options to see if we think there >> is a better one (if there isn't then I would propose we continue with the >> current proposal). Please chime in if I missed one but off the top of my >> head these are: >> >>> >> >>> a. Go forward with current proposal >> >>> b. Create a different format DV that we feel is a better, and take >> no additional steps for compatibility with Delta Lake. >> >>> c. Create a different format DV that we feel is a better, and allow >> backwards compatibility by adding "reader" support for Delta Lake DVs in >> the spec, but not "writer support". >> >>> d. Go forward with the current proposal but use offset and length to >> trim off the "offset" bytes. (I assume this would break Delta Lake Readers >> but I think Iceberg Readers could still read Delta Lake tables). This >> option is very close to C but doesn't address all concerns around DV >> format). >> >>> >> >>> Out of these three, my slight preference would be option c (add >> migration capabilities from Delta Lake to Iceberg), followed by option a >> (current proposal). >> >>> >> >>> Cheers, >> >>> Micah >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On Tue, Oct 15, 2024 at 9:32 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>>> >> >>>> @Scott We would have the ability to read delta vectors regardless of >> what we pick since on Iceberg side we really just need the bitmap and what >> offset it is located at within a file, everything else could be in the >> Iceberg metadata. We don’t have any disagreement on this aspect I think. >> >>>> >> >>>> The question is whether we would write additional Delta specific >> metadata into the vector itself that an Iceberg implementation would not >> use so that current Delta readers could read Iceberg delete vectors without >> a code change or rewriting the vectors. The underlying representation would >> still be the same between the two formats. >> >>>> >> >>>> The pros to doing this are that a reverse translation of iceberg to >> delta would be much simpler. Any implementers who already have delta >> vector read code can probably mostly reuse it although our metadata would >> let them skip to just reading the bitmap. >> >>>> >> >>>> The cons are that the metadata being written isn’t used by Iceberg >> so any real tests would require using a delta reader, anything else would >> just be synthetic. Theoretically we could end up with iceberg implementers >> who have bugs in this part of the code and we wouldn’t even know it was an >> issue till someone converted the table to delta. Other iceberg readers >> would just be ignoring these bytes, so we essentially are adding a >> requirement and complexity (although not that much) to Iceberg writers for >> the benefit of current Delta readers. Delta would probably also have to add >> a new fields to their metadata representations to capture the vector >> metadata to handle our bitmaps. >> >>>> >> >>>> On Tue, Oct 15, 2024 at 5:56 PM Scott Cowell < >> scott.cow...@dremio.com.invalid> wrote: >> >>>>> >> >>>>> From an engine perspective I think compatibility between Delta and >> Iceberg on DVs is a great thing to have. The additions for cross-compat >> seem a minor thing to me that is vastly outweighed by a future where Delta >> tables with DVs were supported in Delta Uniform and could be read by any >> Iceberg V3 compliant engine. >> >>>>> >> >>>>> -Scott >> >>>>> >> >>>>> On Tue, Oct 15, 2024 at 2:06 PM Anton Okolnychyi < >> aokolnyc...@gmail.com> wrote: >> >>>>>> >> >>>>>> Are there engines/vendors/companies in the community that support >> both Iceberg and Delta and would benefit from having one blob layout for >> DVs? >> >>>>>> >> >>>>>> - Anton >> >>>>>> >> >>>>>> вт, 15 жовт. 2024 р. о 11:10 rdb...@gmail.com <rdb...@gmail.com> >> пише: >> >>>>>>> >> >>>>>>> Thanks, Szehon. >> >>>>>>> >> >>>>>>> To clarify on compatibility, using the same format for the blobs >> makes it so that existing Delta readers can read and use the DVs written by >> Iceberg. I'd love for Delta to adopt Puffin, but if we adopt the extra >> fields they would not need to change how readers work. That's why I think >> there is a benefit to using the same format. We avoid fragmentation and >> make sure data and delete files are compatible. No unnecessary >> fragmentation. >> >>>>>>> >> >>>>>>> Ryan >> >>>>>>> >> >>>>>>> On Tue, Oct 15, 2024 at 10:57 AM Szehon Ho < >> szehon.apa...@gmail.com> wrote: >> >>>>>>>> >> >>>>>>>> This is awesome work by Anton and Ryan, it looks like a ton of >> effort has gone into the V3 position vector proposal to make it clean and >> efficient, a long time coming and Im truly excited to see the great >> improvement in storage/perf. >> >>>>>>>> >> >>>>>>>> wrt to these fields, I think most of the concerns are already >> mentioned by the other community members in the prs >> https://github.com/apache/iceberg/pull/11238 and >> https://github.com/apache/iceberg/pull/11238, so not much to add. The >> DV itself is RoaringBitmap 64-bit format so that's great, the argument for >> CRC seems reasonable, and I dont have enough data to be opinionated towards >> endian/magic byte. >> >>>>>>>> >> >>>>>>>> But I do lean towards the many PR comments that the extra length >> field is unnecessary, and would just add confusion. It seemed to me that >> the Iceberg community has made so much effort to trim to spec to the bare >> minimum for cleanliness and efficiency, so I do feel the field is not in >> the normal direction of the project. Also Im not clear on the plan for old >> Delta readers, they cant read Puffin anyway, if Delta adopts Puffin, then >> new readers could adopt? Anyway great work again, thanks for raising the >> issue on devlist! >> >>>>>>>> >> >>>>>>>> Thanks, >> >>>>>>>> Szehon >> >>>>>>>> >> >>>>>>>> On Mon, Oct 14, 2024 at 5:14 PM rdb...@gmail.com < >> rdb...@gmail.com> wrote: >> >>>>>>>>> >> >>>>>>>>> > I think it might be worth mentioning the current proposal >> makes some, mostly minor, design choices to try to be compatible with Delta >> Lake deletion vectors. >> >>>>>>>>> >> >>>>>>>>> Yes it does, and thanks for pointing this out, Micah. I think >> it's important to consider whether compatibility is important to this >> community. I just replied to Piotr on the PR, but I'll adapt some of that >> response here to reach the broader community. >> >>>>>>>>> >> >>>>>>>>> I think there is value in supporting compatibility with older >> Delta readers, but I acknowledge that this may be my perspective because my >> employer has a lot of Delta customers that we are going to support now and >> in the future. >> >>>>>>>>> >> >>>>>>>>> The main use case for maintaining compatibility with the Delta >> format is that it's hard to move old jobs to new code in a migration. We >> see a similar issue in Hive to Iceberg migrations, where unknown older >> readers prevent migration entirely because they are hard to track down and >> often read files directly from the backing object store. I'd like to avoid >> the same problem here, where all readers need to be identified and migrated >> at the same time. Compatibility with the format those readers expect makes >> it possible to maintain Delta metadata for them temporarily. That increases >> confidence that things won't randomly break and makes it easier to get >> people to move forward. >> >>>>>>>>> >> >>>>>>>>> The second reason for maintaining compatibility is that we want >> for the formats to become more similar. My hope is that we can work across >> both communities and come up with a common metadata format in a future >> version -- which explains my interest in smooth migrations. Maintaining >> compatibility in cases like this builds trust and keeps our options open: >> if we have compatible data layers, then it's easier to build a compatible >> metadata layer. I'm hoping that if we make the blob format compatible, we >> can get the Delta community to start using Puffin for better >> self-describing delete files. >> >>>>>>>>> >> >>>>>>>>> Other people may not share those goals, so I think it helps to >> consider what is being compromised for this compatibility. I don't think it >> is too much. There are 2 additional fields: >> >>>>>>>>> * A 4-byte length field (that Iceberg doesn't need) >> >>>>>>>>> * A 4-byte CRC to validate the contents of the bitmap >> >>>>>>>>> >> >>>>>>>>> There are also changes to how these would have been added if >> the Iceberg community were building this independently. >> >>>>>>>>> * Our initial version didn't include a CRC at all, but now that >> we think it's useful compatibility means using a CRC-32 checksum rather >> than a newer one >> >>>>>>>>> * The Delta format uses big endian for its fields (or mixed >> endian if you consider RoaringBitmap is LE) >> >>>>>>>>> * The magic bytes (added to avoid reading the Puffin footer) >> would have been different >> >>>>>>>>> >> >>>>>>>>> Overall, I don't think that those changes to what we would have >> done are unreasonable. It's only 8 extra bytes and half of them are for a >> checksum that is a good idea. >> >>>>>>>>> >> >>>>>>>>> I'm looking forward to what the rest of the community thinks >> about this. Thanks for reviewing the PR! >> >>>>>>>>> >> >>>>>>>>> Ryan >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> On Sun, Oct 13, 2024 at 10:45 PM Jean-Baptiste Onofré < >> j...@nanthrax.net> wrote: >> >>>>>>>>>> >> >>>>>>>>>> Hi >> >>>>>>>>>> >> >>>>>>>>>> Thanks for the PRs ! I reviewed Anton's document, I will do a >> pass on the PRs. >> >>>>>>>>>> >> >>>>>>>>>> Imho, it's important to get feedback from query engines, as, >> if delete >> >>>>>>>>>> vectors is not a problem per se (it's what we are using as >> internal >> >>>>>>>>>> representation), the use of Puffin files to store it is >> "impactful" >> >>>>>>>>>> for the query engines (probably some query engines might need >> to >> >>>>>>>>>> implement Puffin spec (read/write) using other language than >> Java, for >> >>>>>>>>>> instance Apache Impala). >> >>>>>>>>>> >> >>>>>>>>>> I like the proposal, I just hope we won't "surprise" some query >> >>>>>>>>>> engines with extra work :) >> >>>>>>>>>> >> >>>>>>>>>> Regards >> >>>>>>>>>> JB >> >>>>>>>>>> >> >>>>>>>>>> On Thu, Oct 10, 2024 at 11:41 PM rdb...@gmail.com < >> rdb...@gmail.com> wrote: >> >>>>>>>>>> > >> >>>>>>>>>> > Hi everyone, >> >>>>>>>>>> > >> >>>>>>>>>> > There seems to be broad agreement around Anton's proposal to >> use deletion vectors in Iceberg v3, so I've opened two PRs that update the >> spec with the proposed changes. The first, PR #11238, adds a new Puffin >> blob type, delete-vector-v1, that stores a delete vector. The second, PR >> #11240, updates the Iceberg table spec. >> >>>>>>>>>> > >> >>>>>>>>>> > Please take a look and comment! >> >>>>>>>>>> > >> >>>>>>>>>> > Ryan >> >