There's a potential solution that's similar to what Xianjin suggested. Rather than adding a new SQL keyword (which is a lot of work and specific to Iceberg) we would instead add support for pushing down `CAST` expressions from Spark. That way you could use filters like `DELETE FROM table WHERE cast(ts AS date) = '2024-08-02'` and the alignment with partitions would happen automatically. This would be generally useful for selecting all the data for a date easily as well.
On Fri, Aug 2, 2024 at 6:32 AM Xianjin YE <xian...@apache.org> wrote: > > b) they have a concern that with getting the WHERE filter of the DELETE > not aligned with partition boundaries they might end up having pos-deletes > that could have an impact on their read perf > > I think this is a legit concern and currently `DELETE FROM` cannot > guarantee that. It would be valuable to have a way to enforce that. > > Like others already pointed out, the concept of partitioning is not > aligned in Iceberg and Hive, adding `DROP PARTITION` directly might confuse > people if they have changed the partition spec of table. > > How about we adding a new SQL syntax to express the semantic of > Delete/Drop partition only? Something like: > > ``` > DELETE PARTITION FROM table WHERE delete_filter > ``` > The SQL will only delete partitions if it’s a metadata only operation. > > On Aug 2, 2024, at 20:34, Gabor Kaszab <gaborkas...@cloudera.com.INVALID> > wrote: > > Hey Everyone, > > Thanks for the responses and sorry for the long delay in mine. Let me try > to answer the questions that came up. > > Yes, this has been an ask from a specific user who finds the lack of DROP > PARTITION as a blocker for migrating to Iceberg from Hive tables. I know, > our initial response was too to use DELETE FROM instead but a) there are > users who grew that big that it's nearly impossible to educate and b) they > have a concern that with getting the WHERE filter of the DELETE not aligned > with partition boundaries they might end up having pos-deletes that could > have an impact on their read perf. So they find it very crucial to have a > guarantee that when they try to drop data within a partition it's either a > metadata only operation or it fails. > > About ADD PARTITION: I agree it wouldn't make sense for Iceberg, but > fortunately there is no user ask for it either. I think DROP PARTITION > would still make sense without ADD PARTITION as the later one would be a > no-op in the Iceberg world. > > I gave this some thoughts and even though the concept of partitioning is > not aligned with a command like DROP PARTITION, I still see rationale to > implement it anyway. There are always going to be users coming from the > Hive-table world, it has some safety nets, and - even though I have no > contributions in Spark or Iceberg-Spark - this seems an isolated feature > that has no risk of causing regressions in the existing code. Partition > evolution is something that has to be given some extra thought wrt DROP > PARTITION as the Hive-world didn't have that, but in case we can have a > consensus on that I feel that this addition has added value. > > Not sure I know what it means to have a use-case specific implementation > instead of having it in e.g. Iceberg-Spark. > > Have a nice weekend! > Gabor > > On Mon, Jul 22, 2024 at 7:05 PM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi Walaa >> >> It makes sense, thanks for pointing the use case. >> >> I agree that it's better to consider a use-case specific impl. >> >> Regards >> JB >> >> On Wed, Jul 17, 2024 at 11:36 PM Walaa Eldin Moustafa >> <wa.moust...@gmail.com> wrote: >> > >> > Hi Jean, One use case is Hive to Iceberg migration, where DROP >> PARTITION does not need to change to DELETE queries prior to the migration. >> > >> > That said, I am not in favor of adding this to Iceberg directly (or >> Iceberg-Spark) due to the reasons Jean mentioned. It might be possible to >> do it in a custom extension or custom connector outside Iceberg that is >> specific for the use case (e.g., the migration use case I mentioned above). >> > >> > Further, as Szhehon said, it would not make sense without ADD >> PARTITION. However, ADD PARTITION requires a spec change (since Iceberg >> does not support empty partitions but ADD PARTITION does). >> > >> > So overall I am -1 to DROP PARTITION in Iceberg default implementation, >> and I think it is better to consider implementing in a use case specific >> implementation. >> > >> > Thanks, >> > Walaa. >> > >> > >> > On Wed, Jul 17, 2024 at 12:34 PM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >> >> >> Hi Gabor >> >> >> >> Do you have user requests for that ? As Iceberg produces partitions by >> >> taking column values (optionally with a transform function). So the >> >> hidden partitioning doesn't require user actions. I wonder the use >> >> cases for dynamic partitioning (using ADD/DROP). Is it more for >> >> partition maintenance ? >> >> >> >> Thanks ! >> >> Regards >> >> JB >> >> >> >> On Wed, Jul 17, 2024 at 11:11 AM Gabor Kaszab <gaborkas...@apache.org> >> wrote: >> >> > >> >> > Hey Community, >> >> > >> >> > I learned recently that Spark doesn't support DROP PARTITION for >> Iceberg tables. I understand this is because the DROP PARTITION is >> something being used for Hive tables and Iceberg's model for hidden >> partitioning makes it unnatural to have commands like this. >> >> > >> >> > However, I think that DROP PARTITION would still have some value for >> users. In fact in Impala we implemented this even for Iceberg tables. >> Benefits could be: >> >> > - Users having workloads on Hive tables could use their workloads >> after they migrated their tables to Iceberg. >> >> > - Opposed to DELETE FROM, DROP PARTITION has a guarantee that this >> is going to be a metadata only operation and no delete files are going to >> be written. >> >> > >> >> > I'm curious what the community thinks of this. >> >> > Gabor >> >> > >> > > -- Ryan Blue Databricks