Re: [DISCUSS] DROP PARTITION in Spark

Xianjin YE Fri, 02 Aug 2024 10:20:51 -0700

> we would instead add support for pushing down `CAST` expressions from Spark


Supporting pushing down more expressions is definitely worthy to explore with. 
IIUC, we should already be able to do this kind of thing thanks to system 
function push down. Users can issue a query to deterministically delete 
partitions only in Spark, like 
```
DELETE FROM table WHERE system.days(ts) = date(‘2024-08-03’)
```
where ts is the partition source field of days transform in the table.


> Rather than adding a new SQL keyword (which is a lot of work and specific to 
> Iceberg)

That’s true, maybe we can start with a session conf or consulting the Spark 
community to add the ability to enforce deletion via metadata operation only? 
Even if we can perfectly delete from table with partitions, users might still 
want to be absolutely sure that the delete SQL doesn’t accidentally create 
pos-delete files or rewrite the whole data files.

> On Aug 2, 2024, at 23:27, Ryan Blue <b...@databricks.com.INVALID> wrote:
> 
> There's a potential solution that's similar to what Xianjin suggested. Rather 
> than adding a new SQL keyword (which is a lot of work and specific to 
> Iceberg) we would instead add support for pushing down `CAST` expressions 
> from Spark. That way you could use filters like `DELETE FROM table WHERE 
> cast(ts AS date) = '2024-08-02'` and the alignment with partitions would 
> happen automatically. This would be generally useful for selecting all the 
> data for a date easily as well.
> 
> On Fri, Aug 2, 2024 at 6:32 AM Xianjin YE <xian...@apache.org 
> <mailto:xian...@apache.org>> wrote:
>> > b) they have a concern that with getting the WHERE filter of the DELETE 
>> > not aligned with partition boundaries they might end up having pos-deletes 
>> > that could have an impact on their read perf
>> 
>> I think this is a legit concern and currently `DELETE FROM` cannot guarantee 
>> that. It would be valuable to have a way to enforce that. 
>> 
>> Like others already pointed out, the concept of partitioning is not aligned 
>> in Iceberg and Hive, adding `DROP PARTITION` directly might confuse people 
>> if they have changed the partition spec of table.
>> 
>> How about we adding a new SQL syntax to express the semantic of Delete/Drop 
>> partition only? Something like:
>> 
>> ```
>> DELETE PARTITION FROM table WHERE delete_filter
>> ```
>> The SQL will only delete partitions if it’s a metadata only operation.
>> 
>>> On Aug 2, 2024, at 20:34, Gabor Kaszab <gaborkas...@cloudera.com.INVALID> 
>>> wrote:
>>> 
>>> Hey Everyone,
>>> 
>>> Thanks for the responses and sorry for the long delay in mine. Let me try 
>>> to answer the questions that came up.
>>> 
>>> Yes, this has been an ask from a specific user who finds the lack of DROP 
>>> PARTITION as a blocker for migrating to Iceberg from Hive tables. I know, 
>>> our initial response was too to use DELETE FROM instead but a) there are 
>>> users who grew that big that it's nearly impossible to educate and b) they 
>>> have a concern that with getting the WHERE filter of the DELETE not aligned 
>>> with partition boundaries they might end up having pos-deletes that could 
>>> have an impact on their read perf. So they find it very crucial to have a 
>>> guarantee that when they try to drop data within a partition it's either a 
>>> metadata only operation or it fails.
>>> 
>>> About ADD PARTITION: I agree it wouldn't make sense for Iceberg, but 
>>> fortunately there is no user ask for it either. I think DROP PARTITION 
>>> would still make sense without ADD PARTITION as the later one would be a 
>>> no-op in the Iceberg world.
>>> 
>>> I gave this some thoughts and even though the concept of partitioning is 
>>> not aligned with a command like DROP PARTITION, I still see rationale to 
>>> implement it anyway. There are always going to be users coming from the 
>>> Hive-table world, it has some safety nets, and - even though I have no 
>>> contributions in Spark or Iceberg-Spark - this seems an isolated feature 
>>> that has no risk of causing regressions in the existing code. Partition 
>>> evolution is something that has to be given some extra thought wrt DROP 
>>> PARTITION as the Hive-world didn't have that, but in case we can have a 
>>> consensus on that I feel that this addition has added value.
>>> 
>>> Not sure I know what it means to have a use-case specific implementation 
>>> instead of having it in e.g. Iceberg-Spark.
>>> 
>>> Have a nice weekend!
>>> Gabor
>>> 
>>> On Mon, Jul 22, 2024 at 7:05 PM Jean-Baptiste Onofré <j...@nanthrax.net 
>>> <mailto:j...@nanthrax.net>> wrote:
>>>> Hi Walaa
>>>> 
>>>> It makes sense, thanks for pointing the use case.
>>>> 
>>>> I agree that it's better to consider a use-case specific impl.
>>>> 
>>>> Regards
>>>> JB
>>>> 
>>>> On Wed, Jul 17, 2024 at 11:36 PM Walaa Eldin Moustafa
>>>> <wa.moust...@gmail.com <mailto:wa.moust...@gmail.com>> wrote:
>>>> >
>>>> > Hi Jean, One use case is Hive to Iceberg migration, where DROP PARTITION 
>>>> > does not need to change to DELETE queries prior to the migration.
>>>> >
>>>> > That said, I am not in favor of adding this to Iceberg directly (or 
>>>> > Iceberg-Spark) due to the reasons Jean mentioned. It might be possible 
>>>> > to do it in a custom extension or custom connector outside Iceberg that 
>>>> > is specific for the use case (e.g., the migration use case I mentioned 
>>>> > above).
>>>> >
>>>> > Further, as Szhehon said, it would not make sense without ADD PARTITION. 
>>>> > However, ADD PARTITION requires a spec change (since Iceberg does not 
>>>> > support empty partitions but ADD PARTITION does).
>>>> >
>>>> > So overall I am -1 to DROP PARTITION in Iceberg default implementation, 
>>>> > and I think it is better to consider implementing in a use case specific 
>>>> > implementation.
>>>> >
>>>> > Thanks,
>>>> > Walaa.
>>>> >
>>>> >
>>>> > On Wed, Jul 17, 2024 at 12:34 PM Jean-Baptiste Onofré <j...@nanthrax.net 
>>>> > <mailto:j...@nanthrax.net>> wrote:
>>>> >>
>>>> >> Hi Gabor
>>>> >>
>>>> >> Do you have user requests for that ? As Iceberg produces partitions by
>>>> >> taking column values (optionally with a transform function). So the
>>>> >> hidden partitioning doesn't require user actions. I wonder the use
>>>> >> cases for dynamic partitioning (using ADD/DROP). Is it more for
>>>> >> partition maintenance ?
>>>> >>
>>>> >> Thanks !
>>>> >> Regards
>>>> >> JB
>>>> >>
>>>> >> On Wed, Jul 17, 2024 at 11:11 AM Gabor Kaszab <gaborkas...@apache.org 
>>>> >> <mailto:gaborkas...@apache.org>> wrote:
>>>> >> >
>>>> >> > Hey Community,
>>>> >> >
>>>> >> > I learned recently that Spark doesn't support DROP PARTITION for 
>>>> >> > Iceberg tables. I understand this is because the DROP PARTITION is 
>>>> >> > something being used for Hive tables and Iceberg's model for hidden 
>>>> >> > partitioning makes it unnatural to have commands like this.
>>>> >> >
>>>> >> > However, I think that DROP PARTITION would still have some value for 
>>>> >> > users. In fact in Impala we implemented this even for Iceberg tables. 
>>>> >> > Benefits could be:
>>>> >> >  - Users having workloads on Hive tables could use their workloads 
>>>> >> > after they migrated their tables to Iceberg.
>>>> >> >  - Opposed to DELETE FROM, DROP PARTITION has a guarantee that this 
>>>> >> > is going to be a metadata only operation and no delete files are 
>>>> >> > going to be written.
>>>> >> >
>>>> >> > I'm curious what the community thinks of this.
>>>> >> > Gabor
>>>> >> >
>> 
> 
> 
> --
> Ryan Blue
> Databricks

Re: [DISCUSS] DROP PARTITION in Spark

Reply via email to