I think I understood the Rewrite strategy discussion a little differently Binpack Strategy and SortStrategy each get a new flag which lets you pick files based on their number of delete files. So basically you can set a variety of parameters, small files, large files, files with deletes etc ...
A new strategy is added which determines which file to rewrite by looking for all files currently touched by delete files. Instead of looking through files with X deletes, we look up all files affected by deletes and rewrite them. Although now as I write this it's basically running the above strategies with number of delete files >= 1 and files per group at 1. So maybe it doesn't need another strategy? But maybe I got that wrong ... On Thu, Oct 21, 2021 at 8:39 PM Jack Ye <yezhao...@gmail.com> wrote: > Thanks to everyone who came to the meeting. > > Here is the full meeting recording I made: > https://drive.google.com/file/d/1yuBFlNn9nkMlH9TIut2H8CXmJGLd18Sa/view?usp=sharing > > Here are some key takeaways: > > 1. we generally agreed upon the division of compactions into Rewrite, > Convert and Merge. > > 2. Merge will be implemented through RewriteDataFiles as proposed in > https://github.com/apache/iceberg/pull/3207, but instead as a new > strategy by extending the existing BinPackStrategy. For users who would > also like to run sort during Merge, we will have another delete strategy > that extends the SortStrategy. > > 3. Merge can have an option that allows users to set the minimum numbers > of delete files to trigger a compaction. However, that would result in very > frequent compaction of full partition if people add many global delete > files. A Convert of global equality deletes to partition position deletes > while maintaining the same sequence number can be used to solve the issue. > Currently there is no way to write files with a custom sequence number. > This functionality needs to be added. > > 4. we generally agreed upon the APIs for Rewrite and Convert at > https://github.com/apache/iceberg/pull/2841. > > 5. we had some discussion around the separation of row and partition level > filters. The general direction in the meeting is to just have a single > filter method. We will sync offline to reach an agreement. > > 6. people raised the issue that if new delete files are added to a data > file while a Merge is going on, then the Merge would fail. That causes huge > performance issues for CDC streaming use cases and Merge is very hard to > succeed. There are 2 proposed solutions: > (1) for hot partitions, users can try to only perform Convert and > Rewrite to keep delete file sizes and count manageable, until the partition > becomes cold and a Merge can be performed safely. > (2) it looks like we need a Merge strategy that does not do any > bin-packing, and only merges the delete files for each data file and writes > it back. The new data file will have the same sequence number as the old > file before Merge. By doing so, new delete files can still be applied > safely and the compaction can succeed without concerns around conflict. The > caveat is that this does not work for position deletes because the row > position changes for each file after Merge. But for the CDC streaming use > case it is acceptable to only write equality deletes, so this looks like a > feasible approach. > > 7. people raised the concern about the memory consumption issue for the > is_deleted metadata column. We ran out of time and will continue the > discussion offline on Slack. > > Best, > Jack Ye > > > > On Mon, Oct 18, 2021 at 7:50 PM Jack Ye <yezhao...@gmail.com> wrote: > >> Hi everyone, >> >> We are planning to have a meeting to discuss the design of Iceberg delete >> compaction on Thursday 5-6pm PDT. The meeting link is >> https://meet.google.com/nxx-nnvj-omx. >> >> We have also created the channel #compaction on Slack, please join the >> channel for daily discussions if you are interested in the progress. >> >> Best, >> Jack Ye >> >> On Tue, Sep 28, 2021 at 10:23 PM Jack Ye <yezhao...@gmail.com> wrote: >> >>> Hi everyone, >>> >>> As there are more and more people adopting the v2 spec, we are seeing an >>> increasing number of requests for delete compaction support. >>> >>> Here is a document discussing the use cases and basic interface design >>> for it to get the community aligned around what compactions we would offer >>> and how the interfaces would be divided: >>> https://docs.google.com/document/d/1-EyKSfwd_W9iI5jrzAvomVw3w1mb_kayVNT7f2I-SUg >>> >>> Any feedback would be appreciated! >>> >>> Best, >>> Jack Ye >>> >>