Re: Changing default delete file granularity for Spark writes from partition to file scoped

Amogh Jahagirdar Mon, 23 Dec 2024 07:54:00 -0800

Hey all, it's been a while but I wanted to follow up on this thread in case
there were any remaining thoughts on changing the default write granularity
in Spark to be file scoped.


Given the degenerate amplification case for partition scoped deletes and
the additional benefits outlined earlier, I think it would be great if we
could move towards this.

I'd also like to point out that the latest version of my PR [1] changes the
default to file-scoped deletes for all Spark 3.5 writes via SparkWriteConf
rather than just new V2 tables. It's potentially a wider impacting change
but considering the benefits mentioned earlier in the thread it seems worth
it. In the chance that any particular job has issues with this default
change, the conf can be overridden.

Please let me know what you think!

[1] https://github.com/apache/iceberg/pull/11478

Thanks,

Amogh Jahagirdar

On Tue, Nov 26, 2024 at 3:54 PM Amogh Jahagirdar <2am...@gmail.com> wrote:

>
>
> Just following up on this thread,
>
> Getting numbers for various table layouts is involved and I think it would
> instead be helpful to look at a degenerate read amplification case arising
> from partition-scoped deletes.
> Furthermore, there are some additional benefits to file scope deletes that
> are worth clarifying that benchmarking alone won't capture.
>
>
> *Degenerate Read Amplification Case for Partition Scoped Deletes*
> Consider a table with id and data columns, partitioned with bucket(10, id)
> and 10 data files per bucket. With partition scoped deletes, if a delete
> affects only two files within a bucket, any read touching the 8 unaffected
> files in a partition will unnecessarily fetch the delete file. Generalized:
> Say there are D data files per partition, W writes performed, and I
> impacted data files for deletes, where I << D
>
>    - Partition-scoped: O(D * W) delete file reads in worst case for a
>    given partition because each write will produce a partition scoped delete
>    which would need to be read for each data file.
>    - File-scoped: O(I) delete file reads in worst case for a given
>    partition
>
> The key above is how read amplification with partition scoped deletes can
> increase with every write that's performed, and this is further compounded
> in aggregate by how many partitions are impacted as well.
> The file-scoped deletes that need to be read scale independently of the
> number of writes that are performed since they're targeted per data file
> and each delete is being maintained on write to make sure there's not
> multiple delete files for a given data file.
>
> *Additional benefits to file scoped deletes*:
>
>    - Now that file scoped deletes are being maintained as part of writes,
>    old deletes will be removed from storage. With partition scoped deletes,
>    the delete file cannot even be removed even if it's replaced for a given
>    data file since there can be deletes for other data files.
>    - Moving towards file scoped deletes will help avoid unnecessary
>    conflicts with concurrent compactions/other writes.
>    - There are other engines which already write file scoped deletes,
>    Anton mentioned Flink earlier in the thread, which also ties into point 1
>    above, since much of the goal of writing file scoped deletes was to avoid
>    unnecessary conflicts with concurrent compactions. Trino
>    
> <https://github.com/trinodb/trino/blob/332a658f3a292f79da203a131d2e25b2418d5279/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/PositionDeleteWriter.java#L73>
>    is another engine which writes file scoped position deletes as of today.
>    - We know that V3 deletion vectors will be an improvement in the
>    majority of cases when compared to existing position deletes. Considering
>    this and the fact that file-scoped deletes are closer to DVs as Anton
>    mentioned, moving towards file scoped deletes makes it easier to migrate
>    from the file-scoped deletes to DVs.
>
> It's worth clarifying that in the case there is some regression for a
> given workload it's always possible to set the property back to partition
> scoped.
>
> Thanks,
> Amogh Jahagirdar
>
> On Fri, Nov 15, 2024 at 11:54 AM Amogh Jahagirdar <2am...@gmail.com>
> wrote:
>
>> Following up on this thread,
>>
>> > I don't think this is a bad idea from a theoretical perspective. Do we
>> have any actual numbers to back up the change?
>> There are no numbers yet, changing the default is largely driven by the
>> fact that the previous downside of file scoped deletes leading to many
>> files on disk, is now mitigated by Spark sync maintenance.
>>
>> To get some more numbers, I think we'll first need to update some of our
>> benchmarks to be more representative of typical MoR workloads. For example
>> DeleteFileIndexBenchmark currently tests tables with a small amount of
>> partitions with many data files and very few delete files, so this leads to
>> a huge difference in files between partition scoped and file scoped for
>> that benchmark and doesn't display the benefits of moving to file scoped
>> deletes by default. I'll look into updating some of these benchmarks.
>>
>> Thanks,
>>
>> Amogh Jahagirdar
>>
>> On Tue, Nov 12, 2024 at 12:36 PM Anton Okolnychyi <aokolnyc...@gmail.com>
>> wrote:
>>
>>> I support the idea of switching to file-scoped deletes for new tables.
>>> The absence of sync maintenance prevented us from doing that earlier. Given
>>> that Amogh recently merged that functionality into main, we should be fine.
>>> Flink connectors have been using file-scoped deletes for a while now to
>>> solve issues with concurrent compactions.
>>>
>>> File-scoped deletes are closer to DVs and even share some code like
>>> assignment by path and sync maintenance in Spark. This migration will allow
>>> us to test many concepts that are needed for DVs prior to completing the V3
>>> spec. It is also easier to migrate file-scoped deletes to DVs in V3. Unlike
>>> partition-scoped deletes, they can be discarded from the table state once a
>>> DV is produced for that data file.
>>>
>>> - Anton
>>>
>>>
>>>
>>> пн, 11 лист. 2024 р. о 21:01 Russell Spitzer <russell.spit...@gmail.com>
>>> пише:
>>>
>>>> I don't think this is a bad idea from a theoretical perspective. Do we
>>>> have any actual numbers to back up the change? I would think for most folks
>>>> we would recommend just going to V3 rather than changing granularity for
>>>> their new tables. It would just affect new tables though so I'm not opposed
>>>> to the change except for on the grounds that we are changing a default.
>>>>
>>>> On Mon, Nov 11, 2024 at 1:56 PM Amogh Jahagirdar <2am...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I wanted to discuss changing the default position delete file
>>>>> granularity for Spark from partition to file level for any newly created 
>>>>> V2
>>>>> tables. See this PR [1]
>>>>>
>>>>> Context on delete file granularity:
>>>>>
>>>>>    - Partition granularity: Writers group delete files for multiple
>>>>>    data files from the same partition into the same delete file. This 
>>>>> leads to
>>>>>    fewer files on disk, but higher read amplification from reading delete
>>>>>    information from irrelevant data files for a scan.
>>>>>    - File granularity: Writers write a new delete file for every
>>>>>    changed data file. More targeted reads of relevant delete information 
>>>>> occur
>>>>>    but this can lead to more files on disk.
>>>>>
>>>>> With the recent merge of synchronous position delete maintenance on
>>>>> write in Spark [2], file granularity as a default is more compelling since
>>>>> reads would be more targeted *and* files would be maintained on disk.
>>>>> I also recommend folks go through the deletion vector design doc for more
>>>>> details [3].
>>>>>
>>>>> Note that for existing tables with high delete-to-data file ratios,
>>>>> Iceberg's rewrite position deletes procedure can compact the table and
>>>>> every subsequent write would continuously maintain the position deletes.
>>>>> Additionally note that in V3, at most one puffin position delete file is
>>>>> allowed per data file; what's being discussed here is changing the default
>>>>> granularity for new V2 tables since it should generally be better after 
>>>>> the
>>>>> sync maintenance addition.
>>>>>
>>>>> What are folks' thoughts on this?
>>>>>
>>>>> [1] https://github.com/apache/iceberg/pull/11478
>>>>> [2] https://github.com/apache/iceberg/pull/11273
>>>>> [3]
>>>>> https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM/edit?tab=t.0#heading=h.193fl7s89tcg
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Amogh Jahagirdar
>>>>>
>>>>

Re: Changing default delete file granularity for Spark writes from partition to file scoped

Reply via email to