Re: Spec changes for deletion vectors

2024-10-22 Thread rdb...@gmail.com
Thanks for the careful consideration here, everyone. I completely agree that we do not want this to be confused as setting a precedent about delegating design decisions. I don't think the Delta community would do that either! We have to make decisions that are right for Iceberg. It sounds like we

Re: Spec changes for deletion vectors

2024-10-21 Thread Szehon Ho
Im +1 for adding DV (Goal 1) and also +1 for the ability for Iceberg readers to read Delta Lake DV’s, as the magic byte, CRC make sense design-wise (Goal 2). Its nice that there's cross-community collaboration probably in other areas Im not looking at, but I'm -0.5 on writing an otherwise unjust

Re: Spec changes for deletion vectors

2024-10-21 Thread Micah Kornfield
I agree with everything Russell said. I think we should move forward with the current format of DVs to favor compatibility. I'll add that I think the collaboration aspect likely applies to other aspects as well outside of Deletion Vectors (e.g. the work that is happening on Variant type). Thanks

Re: Spec changes for deletion vectors

2024-10-21 Thread Russell Spitzer
I've thought about this a lot and talked it over with a lot of folks. As I've noted before my main concerns are A. Setting a precedent that we are delegating design decisions to another project B. Setting unnecessary requirements that can only really be checked by integration tests with another sy

Re: Spec changes for deletion vectors

2024-10-19 Thread rdb...@gmail.com
Thanks for the summary, Szehon! I would add one thing to the "minimum" for each option. Because we want to be able to seek directly to the DV for a particular data file, I think it's important to start the blob with magic bytes. That way the reader can validate that the offset was correct and that

Re: Spec changes for deletion vectors

2024-10-17 Thread Szehon Ho
So based on Micah's original goals, switch 2 and 3: 1. The best possible implementation of DVs (limited redundancy, no extraneous fields, CPU efficiency, minimal space, etc). 2. The ability for Iceberg readers to read Delta Lake DVs 3. The ability for Delta Lake readers to read Iceberg DVs The

Re: Spec changes for deletion vectors

2024-10-17 Thread Anton Okolnychyi
> > For the conversion from Delta to Iceberg, wouldn't we need to scan all of > the Delta Vectors if we choose a different CRC or other endian-ness? Exactly, we would not be able to expose Delta as Iceberg if we choose a different checksum type or byte order. Does delta mandate that writers also

Re: Spec changes for deletion vectors

2024-10-17 Thread Russell Spitzer
For the conversion from Delta to Iceberg, wouldn't we need to scan all of the Delta Vectors if we choose a different CRC or other endian-ness? Does delta mandate that writers also include this information in their metadata files? On Thu, Oct 17, 2024 at 4:26 PM Anton Okolnychyi wrote: > We would

Re: Spec changes for deletion vectors

2024-10-17 Thread Anton Okolnychyi
We would want to have magic bytes + checksum as part of the blob in Iceberg, as discussed in the spec PRs. If we chose something other than CRC and/or use little endian for all parts of the blob, this would break the compatibility in either direction and would prevent the use case that Scott was me

Re: Spec changes for deletion vectors

2024-10-17 Thread Bart Samwel
I hope it's OK if I chime in. I'm one of the people responsible for the format for position deletes that is used in Delta Lake and I've been reading along with the discussion. Given that the main sticking point is whether this compatibility is worth the associated "not pure" spec, I figured that ma

Re: Spec changes for deletion vectors

2024-10-17 Thread Jean-Baptiste Onofré
Hi folks, As Daniel said, I think we have actually two proposals in one: 1. The first proposal is "improvement of positional delete files", using delete vectors stored in Puffin files. I like this proposal, it makes a lot of sense. I think with a kind of consensus here (we discussed about how to p

Re: Spec changes for deletion vectors

2024-10-16 Thread Daniel Weeks
Hey Everyone, I feel like at this point we've articulated all of the various options and paths forward, but this really just comes down to a matter of whether we want to make a concession here for the purpose of compatibility. If we were building this with no prior art, I would expect to omit the

Re: Spec changes for deletion vectors

2024-10-16 Thread rdb...@gmail.com
Thanks, Russell for the clear summary of the pros and cons! I agree there's some risk to Iceberg implementations, but I think that is mitigated somewhat by code reuse. For example, an engine like Trino could simply reuse code for reading Delta bitmaps, so we would get some validation and support mo

Re: Spec changes for deletion vectors

2024-10-16 Thread Micah Kornfield
One small point > Theoretically we could end up with iceberg implementers who have bugs in > this part of the code and we wouldn’t even know it was an issue till > someone converted the table to delta. I guess we could mandate readers validate all fields here to make sure they are all consistent

Re: Spec changes for deletion vectors

2024-10-15 Thread Russell Spitzer
@Scott We would have the ability to read delta vectors regardless of what we pick since on Iceberg side we really just need the bitmap and what offset it is located at within a file, everything else could be in the Iceberg metadata. We don’t have any disagreement on this aspect I think. The quest

Re: Spec changes for deletion vectors

2024-10-15 Thread Scott Cowell
>From an engine perspective I think compatibility between Delta and Iceberg on DVs is a great thing to have. The additions for cross-compat seem a minor thing to me that is vastly outweighed by a future where Delta tables with DVs were supported in Delta Uniform and could be read by any Iceberg V3

Re: Spec changes for deletion vectors

2024-10-15 Thread Anton Okolnychyi
Are there engines/vendors/companies in the community that support both Iceberg and Delta and would benefit from having one blob layout for DVs? - Anton вт, 15 жовт. 2024 р. о 11:10 rdb...@gmail.com пише: > Thanks, Szehon. > > To clarify on compatibility, using the same format for the blobs make

Re: Spec changes for deletion vectors

2024-10-15 Thread rdb...@gmail.com
Thanks, Szehon. To clarify on compatibility, using the same format for the blobs makes it so that existing Delta readers can read and use the DVs written by Iceberg. I'd love for Delta to adopt Puffin, but if we adopt the extra fields they would not need to change how readers work. That's why I th

Re: Spec changes for deletion vectors

2024-10-15 Thread Szehon Ho
This is awesome work by Anton and Ryan, it looks like a ton of effort has gone into the V3 position vector proposal to make it clean and efficient, a long time coming and Im truly excited to see the great improvement in storage/perf. wrt to these fields, I think most of the concerns are already me

Re: Spec changes for deletion vectors

2024-10-14 Thread rdb...@gmail.com
> I think it might be worth mentioning the current proposal makes some, mostly minor, design choices to try to be compatible with Delta Lake deletion vectors. Yes it does, and thanks for pointing this out, Micah. I think it's important to consider whether compatibility is important to this communi

Re: Spec changes for deletion vectors

2024-10-13 Thread Jean-Baptiste Onofré
Hi Thanks for the PRs ! I reviewed Anton's document, I will do a pass on the PRs. Imho, it's important to get feedback from query engines, as, if delete vectors is not a problem per se (it's what we are using as internal representation), the use of Puffin files to store it is "impactful" for the

Re: Spec changes for deletion vectors

2024-10-13 Thread Anton Okolnychyi
Thanks for putting the spec PRs together, Ryan! A bit of context below. The concept of DVs is not external to Iceberg. We have been using Roaring bitmaps (aka DVs) as an in-memory representation for position deletes, which allowed us to support vectorized reads and buffer out-of-order positions i

Re: Spec changes for deletion vectors

2024-10-12 Thread Yufei Gu
I’d like to offer a perspective on compatibility. If the design is robust and reasonable, it is certainly welcomed. However, if the design falls short, it becomes a compromise—not just for Iceberg users, but for the entire ecosystem. I look forward to hearing your thoughts on this. Yufei On Fr

Re: Spec changes for deletion vectors

2024-10-11 Thread Manu Zhang
Hi Ryan, Do you mean the doc Improve Position Deletes in V3 by Anton? I don't recall Anton used the term "deletion vector" in his proposal. On Sat, Oct 12, 2024 at 12:30 AM Micah Kornfield wrote: > I

Re: Spec changes for deletion vectors

2024-10-11 Thread Micah Kornfield
I think it might be worth mentioning the current proposal makes some, mostly minor, design choices to try to be compatible with Delta Lake deletion vectors. I think there might be a general philosophical question on what compromises the community is willing to make for compatibility reasons. On T