Re: Status of Spark Integration, Questions

Erik Wright Tue, 27 Nov 2018 12:04:11 -0800

>
> *Upserts/Deletes*
>>
>> I have jobs that apply upserts/deletes to datasets. My current approach
>> is:
>>
>>    1. calculate the affected partitions (collected in the Driver)
>>    2. load up the previous versions of all of those partitions as a
>>    DataFrame
>>    3. apply the upserts/deletes
>>    4. write out the new versions of the affected partitions (containing
>>    old data plus/minus the upserts/deletes)
>>    5. update my index file
>>
>> How is this intended to be done in Iceberg? I see that there are a bunch
>> of Table operations. Would it be up to me to still do steps 1-4 and then
>> rely on Iceberg to do step 5 using the table operations?
>>
>
> Currently, you can delete data by reading, filtering, and overwriting what
> you read. That's an atomic operation so it is safe to read and overwrite
> the same data.
>


Does such an operation take advantage of partitioning to minimize write
amplification? For example, let's say I do something like this:

path = 'some_path'
df = read(path)
df = (
    df
    .join(keys_to_delete, on=['partition_col'], how='anti')
    .union(upserts)
)
df.write(path)

Is this going to result in scanning and rewriting the entire dataset even
if the keys to delete are in a small subset of partitions? I would imagine
so. What is the correct way to do this?

Re: Status of Spark Integration, Questions

Reply via email to