Re: [DISCUSS] DROP PARTITION in Spark

Xianjin YE Fri, 02 Aug 2024 06:32:00 -0700

> b) they have a concern that with getting the WHERE filter of the DELETE not 
> aligned with partition boundaries they might end up having pos-deletes that 
> could have an impact on their read perf


I think this is a legit concern and currently `DELETE FROM` cannot guarantee 
that. It would be valuable to have a way to enforce that. 

Like others already pointed out, the concept of partitioning is not aligned in 
Iceberg and Hive, adding `DROP PARTITION` directly might confuse people if they 
have changed the partition spec of table.

How about we adding a new SQL syntax to express the semantic of Delete/Drop 
partition only? Something like:

```
DELETE PARTITION FROM table WHERE delete_filter
```
The SQL will only delete partitions if it’s a metadata only operation.

> On Aug 2, 2024, at 20:34, Gabor Kaszab <[email protected]> 
> wrote:
> 
> Hey Everyone,
> 
> Thanks for the responses and sorry for the long delay in mine. Let me try to 
> answer the questions that came up.
> 
> Yes, this has been an ask from a specific user who finds the lack of DROP 
> PARTITION as a blocker for migrating to Iceberg from Hive tables. I know, our 
> initial response was too to use DELETE FROM instead but a) there are users 
> who grew that big that it's nearly impossible to educate and b) they have a 
> concern that with getting the WHERE filter of the DELETE not aligned with 
> partition boundaries they might end up having pos-deletes that could have an 
> impact on their read perf. So they find it very crucial to have a guarantee 
> that when they try to drop data within a partition it's either a metadata 
> only operation or it fails.
> 
> About ADD PARTITION: I agree it wouldn't make sense for Iceberg, but 
> fortunately there is no user ask for it either. I think DROP PARTITION would 
> still make sense without ADD PARTITION as the later one would be a no-op in 
> the Iceberg world.
> 
> I gave this some thoughts and even though the concept of partitioning is not 
> aligned with a command like DROP PARTITION, I still see rationale to 
> implement it anyway. There are always going to be users coming from the 
> Hive-table world, it has some safety nets, and - even though I have no 
> contributions in Spark or Iceberg-Spark - this seems an isolated feature that 
> has no risk of causing regressions in the existing code. Partition evolution 
> is something that has to be given some extra thought wrt DROP PARTITION as 
> the Hive-world didn't have that, but in case we can have a consensus on that 
> I feel that this addition has added value.
> 
> Not sure I know what it means to have a use-case specific implementation 
> instead of having it in e.g. Iceberg-Spark.
> 
> Have a nice weekend!
> Gabor
> 
> On Mon, Jul 22, 2024 at 7:05 PM Jean-Baptiste Onofré <[email protected] 
> <mailto:[email protected]>> wrote:
>> Hi Walaa
>> 
>> It makes sense, thanks for pointing the use case.
>> 
>> I agree that it's better to consider a use-case specific impl.
>> 
>> Regards
>> JB
>> 
>> On Wed, Jul 17, 2024 at 11:36 PM Walaa Eldin Moustafa
>> <[email protected] <mailto:[email protected]>> wrote:
>> >
>> > Hi Jean, One use case is Hive to Iceberg migration, where DROP PARTITION 
>> > does not need to change to DELETE queries prior to the migration.
>> >
>> > That said, I am not in favor of adding this to Iceberg directly (or 
>> > Iceberg-Spark) due to the reasons Jean mentioned. It might be possible to 
>> > do it in a custom extension or custom connector outside Iceberg that is 
>> > specific for the use case (e.g., the migration use case I mentioned above).
>> >
>> > Further, as Szhehon said, it would not make sense without ADD PARTITION. 
>> > However, ADD PARTITION requires a spec change (since Iceberg does not 
>> > support empty partitions but ADD PARTITION does).
>> >
>> > So overall I am -1 to DROP PARTITION in Iceberg default implementation, 
>> > and I think it is better to consider implementing in a use case specific 
>> > implementation.
>> >
>> > Thanks,
>> > Walaa.
>> >
>> >
>> > On Wed, Jul 17, 2024 at 12:34 PM Jean-Baptiste Onofré <[email protected] 
>> > <mailto:[email protected]>> wrote:
>> >>
>> >> Hi Gabor
>> >>
>> >> Do you have user requests for that ? As Iceberg produces partitions by
>> >> taking column values (optionally with a transform function). So the
>> >> hidden partitioning doesn't require user actions. I wonder the use
>> >> cases for dynamic partitioning (using ADD/DROP). Is it more for
>> >> partition maintenance ?
>> >>
>> >> Thanks !
>> >> Regards
>> >> JB
>> >>
>> >> On Wed, Jul 17, 2024 at 11:11 AM Gabor Kaszab <[email protected] 
>> >> <mailto:[email protected]>> wrote:
>> >> >
>> >> > Hey Community,
>> >> >
>> >> > I learned recently that Spark doesn't support DROP PARTITION for 
>> >> > Iceberg tables. I understand this is because the DROP PARTITION is 
>> >> > something being used for Hive tables and Iceberg's model for hidden 
>> >> > partitioning makes it unnatural to have commands like this.
>> >> >
>> >> > However, I think that DROP PARTITION would still have some value for 
>> >> > users. In fact in Impala we implemented this even for Iceberg tables. 
>> >> > Benefits could be:
>> >> >  - Users having workloads on Hive tables could use their workloads 
>> >> > after they migrated their tables to Iceberg.
>> >> >  - Opposed to DELETE FROM, DROP PARTITION has a guarantee that this is 
>> >> > going to be a metadata only operation and no delete files are going to 
>> >> > be written.
>> >> >
>> >> > I'm curious what the community thinks of this.
>> >> > Gabor
>> >> >

Re: [DISCUSS] DROP PARTITION in Spark

Reply via email to