> > *Upserts/Deletes* >> >> I have jobs that apply upserts/deletes to datasets. My current approach >> is: >> >> 1. calculate the affected partitions (collected in the Driver) >> 2. load up the previous versions of all of those partitions as a >> DataFrame >> 3. apply the upserts/deletes >> 4. write out the new versions of the affected partitions (containing >> old data plus/minus the upserts/deletes) >> 5. update my index file >> >> How is this intended to be done in Iceberg? I see that there are a bunch >> of Table operations. Would it be up to me to still do steps 1-4 and then >> rely on Iceberg to do step 5 using the table operations? >> > > Currently, you can delete data by reading, filtering, and overwriting what > you read. That's an atomic operation so it is safe to read and overwrite > the same data. >
Does such an operation take advantage of partitioning to minimize write amplification? For example, let's say I do something like this: path = 'some_path' df = read(path) df = ( df .join(keys_to_delete, on=['partition_col'], how='anti') .union(upserts) ) df.write(path) Is this going to result in scanning and rewriting the entire dataset even if the keys to delete are in a small subset of partitions? I would imagine so. What is the correct way to do this?