Spark DELETE FROM parallelism

Huadong Liu Tue, 25 May 2021 09:52:45 -0700

Hi iceberg-dev,

I have a table that is partitioned by id (custom partitioning at the
moment, not iceberg hidden partitioning) and event time. Individual DELETE
finishes reasonably fast, for example:


*sql("DELETE FROM table where id_shard=111 and id=111456")*
*sql("DELETE FROM table where id_shard=222 and id=222456")*

I want to reduce table commits by creating a temp view and running a single
DELETE for both. It becomes extremely slow:


*df = spark.createDataFrame([(111, 111456), (222, 222456)], ["id_shard_t",
"id_t"])*

*df.createOrReplaceTempView("rows")*
*sql(DELETE FROM table WHERE (id_shard, id) IN (SELECT id_shard_t, id_t
FROM rows))*

Looks like I lose the benefit of partitioning with multi column join. Any
hint on reducing table commits while maintaining query efficiency? Thank
you!

--
Huadong

Spark DELETE FROM parallelism

Reply via email to