Re: [DISCUSS] DROP PARTITION in Spark

Ryan Blue Fri, 02 Aug 2024 08:29:00 -0700

There's a potential solution that's similar to what Xianjin suggested.
Rather than adding a new SQL keyword (which is a lot of work and specific
to Iceberg) we would instead add support for pushing down `CAST`
expressions from Spark. That way you could use filters like `DELETE FROM
table WHERE cast(ts AS date) = '2024-08-02'` and the alignment with
partitions would happen automatically. This would be generally useful for
selecting all the data for a date easily as well.


On Fri, Aug 2, 2024 at 6:32 AM Xianjin YE <xian...@apache.org> wrote:

> > b) they have a concern that with getting the WHERE filter of the DELETE
> not aligned with partition boundaries they might end up having pos-deletes
> that could have an impact on their read perf
>
> I think this is a legit concern and currently `DELETE FROM` cannot
> guarantee that. It would be valuable to have a way to enforce that.
>
> Like others already pointed out, the concept of partitioning is not
> aligned in Iceberg and Hive, adding `DROP PARTITION` directly might confuse
> people if they have changed the partition spec of table.
>
> How about we adding a new SQL syntax to express the semantic of
> Delete/Drop partition only? Something like:
>
> ```
> DELETE PARTITION FROM table WHERE delete_filter
> ```
> The SQL will only delete partitions if it’s a metadata only operation.
>
> On Aug 2, 2024, at 20:34, Gabor Kaszab <gaborkas...@cloudera.com.INVALID>
> wrote:
>
> Hey Everyone,
>
> Thanks for the responses and sorry for the long delay in mine. Let me try
> to answer the questions that came up.
>
> Yes, this has been an ask from a specific user who finds the lack of DROP
> PARTITION as a blocker for migrating to Iceberg from Hive tables. I know,
> our initial response was too to use DELETE FROM instead but a) there are
> users who grew that big that it's nearly impossible to educate and b) they
> have a concern that with getting the WHERE filter of the DELETE not aligned
> with partition boundaries they might end up having pos-deletes that could
> have an impact on their read perf. So they find it very crucial to have a
> guarantee that when they try to drop data within a partition it's either a
> metadata only operation or it fails.
>
> About ADD PARTITION: I agree it wouldn't make sense for Iceberg, but
> fortunately there is no user ask for it either. I think DROP PARTITION
> would still make sense without ADD PARTITION as the later one would be a
> no-op in the Iceberg world.
>
> I gave this some thoughts and even though the concept of partitioning is
> not aligned with a command like DROP PARTITION, I still see rationale to
> implement it anyway. There are always going to be users coming from the
> Hive-table world, it has some safety nets, and - even though I have no
> contributions in Spark or Iceberg-Spark - this seems an isolated feature
> that has no risk of causing regressions in the existing code. Partition
> evolution is something that has to be given some extra thought wrt DROP
> PARTITION as the Hive-world didn't have that, but in case we can have a
> consensus on that I feel that this addition has added value.
>
> Not sure I know what it means to have a use-case specific implementation
> instead of having it in e.g. Iceberg-Spark.
>
> Have a nice weekend!
> Gabor
>
> On Mon, Jul 22, 2024 at 7:05 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi Walaa
>>
>> It makes sense, thanks for pointing the use case.
>>
>> I agree that it's better to consider a use-case specific impl.
>>
>> Regards
>> JB
>>
>> On Wed, Jul 17, 2024 at 11:36 PM Walaa Eldin Moustafa
>> <wa.moust...@gmail.com> wrote:
>> >
>> > Hi Jean, One use case is Hive to Iceberg migration, where DROP
>> PARTITION does not need to change to DELETE queries prior to the migration.
>> >
>> > That said, I am not in favor of adding this to Iceberg directly (or
>> Iceberg-Spark) due to the reasons Jean mentioned. It might be possible to
>> do it in a custom extension or custom connector outside Iceberg that is
>> specific for the use case (e.g., the migration use case I mentioned above).
>> >
>> > Further, as Szhehon said, it would not make sense without ADD
>> PARTITION. However, ADD PARTITION requires a spec change (since Iceberg
>> does not support empty partitions but ADD PARTITION does).
>> >
>> > So overall I am -1 to DROP PARTITION in Iceberg default implementation,
>> and I think it is better to consider implementing in a use case specific
>> implementation.
>> >
>> > Thanks,
>> > Walaa.
>> >
>> >
>> > On Wed, Jul 17, 2024 at 12:34 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>> >>
>> >> Hi Gabor
>> >>
>> >> Do you have user requests for that ? As Iceberg produces partitions by
>> >> taking column values (optionally with a transform function). So the
>> >> hidden partitioning doesn't require user actions. I wonder the use
>> >> cases for dynamic partitioning (using ADD/DROP). Is it more for
>> >> partition maintenance ?
>> >>
>> >> Thanks !
>> >> Regards
>> >> JB
>> >>
>> >> On Wed, Jul 17, 2024 at 11:11 AM Gabor Kaszab <gaborkas...@apache.org>
>> wrote:
>> >> >
>> >> > Hey Community,
>> >> >
>> >> > I learned recently that Spark doesn't support DROP PARTITION for
>> Iceberg tables. I understand this is because the DROP PARTITION is
>> something being used for Hive tables and Iceberg's model for hidden
>> partitioning makes it unnatural to have commands like this.
>> >> >
>> >> > However, I think that DROP PARTITION would still have some value for
>> users. In fact in Impala we implemented this even for Iceberg tables.
>> Benefits could be:
>> >> >  - Users having workloads on Hive tables could use their workloads
>> after they migrated their tables to Iceberg.
>> >> >  - Opposed to DELETE FROM, DROP PARTITION has a guarantee that this
>> is going to be a metadata only operation and no delete files are going to
>> be written.
>> >> >
>> >> > I'm curious what the community thinks of this.
>> >> > Gabor
>> >> >
>>
>
>

-- 
Ryan Blue
Databricks

Re: [DISCUSS] DROP PARTITION in Spark

Reply via email to