Sure, the prs are https://github.com/apache/spark/pull/44119 (merge), https://github.com/apache/spark/pull/47233 (update), and delete in progress.
Thanks Szehon On Tue, Jul 9, 2024 at 10:27 PM Wing Yew Poon <wyp...@cloudera.com.invalid> wrote: > Hi Szehon, > Thanks for the update. > Can you please point me to the work on supporting DELETE/UPDATE/MERGE in > the DataFrame API? > Thanks, > Wing Yew > > > On Tue, Jul 9, 2024 at 10:05 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > >> Hi, >> >> Just FYI, good news, this change is merged on the Spark side : >> https://github.com/apache/spark/pull/46707 (its the third effort!). In >> next version of Spark, we will be able to pass read properties via SQL to a >> particular Iceberg table such as >> >> SELECT * FROM iceberg.db.table1 WITH (`locality` = `true`) >> >> I will look at write options after this. >> >> There's also progress in supporting DELETE/UPDATE/MERGE from Dataframes >> as well, it should also be coming soon in Spark. >> >> Thanks, >> Szehon >> >> >> >> On Wed, Jul 26, 2023 at 12:46 PM Wing Yew Poon >> <wyp...@cloudera.com.invalid> wrote: >> >>> We are talking about DELETE/UPDATE/MERGE operations. There is only SQL >>> support for these operations. There is no DataFrame API support for them.* >>> Therefore write options are not applicable. Thus SQLConf is the only >>> available mechanism I can use to override the table property. >>> For reference, we currently support setting distribution mode using >>> write option, SQLConf and table property. It seems to me that >>> https://github.com/apache/iceberg/pull/6838/ is a precedent for what >>> I'd like to do. >>> >>> * It would be of interest to support performing DELETE/UPDATE/MERGE from >>> DataFrames, but that is a whole other topic. >>> >>> >>> On Wed, Jul 26, 2023 at 12:04 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> I think we should aim to have the same behavior across properties that >>>> are set in SQL conf, table config, and write options. Having SQL conf >>>> override table config for this doesn't make sense to me. If the need is to >>>> override table configuration, then write options are the right way to do >>>> it. >>>> >>>> On Wed, Jul 26, 2023 at 10:10 AM Wing Yew Poon >>>> <wyp...@cloudera.com.invalid> wrote: >>>> >>>>> I was on vacation. >>>>> Currently, write modes (copy-on-write/merge-on-read) can only be set >>>>> as table properties, and default to copy-on-write. We have a customer who >>>>> wants to use copy-on-write for certain Spark jobs that write to some >>>>> Iceberg table and merge-on-read for other Spark jobs writing to the same >>>>> table, because of the write characteristics of those jobs. This seems like >>>>> a use case that should be supported. The only way they can do this >>>>> currently is to toggle the table property as needed before doing the >>>>> writes. This is not a sustainable workaround. >>>>> Hence, I think it would be useful to be able to configure the write >>>>> mode as a SQLConf. I also disagree that the table property should always >>>>> win. If this is the case, there is no way to override it. The existing >>>>> behavior in SparkConfParser is to use the option if set, else use the >>>>> session conf if set, else use the table property. This applies across the >>>>> board. >>>>> - Wing Yew >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sun, Jul 16, 2023 at 4:48 PM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> Yes, I agree that there is value for administrators from having some >>>>>> things exposed as Spark SQL configuration. That gets much harder when you >>>>>> want to use the SQLConf for table-level settings, though. For example, >>>>>> the >>>>>> target split size is something that was an engine setting in the Hadoop >>>>>> world, even though it makes no sense to use the same setting across >>>>>> vastly >>>>>> different tables --- think about joining a fact table with a dimension >>>>>> table. >>>>>> >>>>>> Settings like write mode are table-level settings. It matters what is >>>>>> downstream of the table. You may want to set a *default* write mode, but >>>>>> the table-level setting should always win. Currently, there are limits to >>>>>> overriding the write mode in SQL. That's why we should add hints. For >>>>>> anything beyond that, I think we need to discuss what you're trying to >>>>>> do. >>>>>> If it's to override a table-level setting with a SQL global, then we >>>>>> should >>>>>> understand the use case better. >>>>>> >>>>>> On Fri, Jul 14, 2023 at 6:09 PM Wing Yew Poon >>>>>> <wyp...@cloudera.com.invalid> wrote: >>>>>> >>>>>>> Also, in the case of write mode (I mean write.delete.mode, >>>>>>> write.update.mode, write.merge.mode), these cannot be set as options >>>>>>> currently; they are only settable as table properties. >>>>>>> >>>>>>> On Fri, Jul 14, 2023 at 5:58 PM Wing Yew Poon <wyp...@cloudera.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I think that different use cases benefit from or even require >>>>>>>> different solutions. I think enabling options in Spark SQL is helpful, >>>>>>>> but >>>>>>>> allowing some configurations to be done in SQLConf is also helpful. >>>>>>>> For Cheng Pan's use case (to disable locality), I think providing a >>>>>>>> conf (which can be added to spark-defaults.conf by a cluster admin) is >>>>>>>> useful. >>>>>>>> For my customer's use case ( >>>>>>>> https://github.com/apache/iceberg/pull/7790), being able to set >>>>>>>> the write mode per Spark job (where right now it can only be set as a >>>>>>>> table >>>>>>>> property) is useful. Allowing this to be done in the SQL with an >>>>>>>> option/hint could also work, but as I understand it, Szehon's PR ( >>>>>>>> https://github.com/apache/spark/pull/416830) is only applicable to >>>>>>>> reads, not writes. >>>>>>>> >>>>>>>> - Wing Yew >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 13, 2023 at 1:04 AM Cheng Pan <pan3...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Ryan, I understand that option should be job-specific, and >>>>>>>>> introducing an OPTIONS HINT can make Spark SQL achieves similar >>>>>>>>> capabilities as DataFrame API does. >>>>>>>>> >>>>>>>>> My point is, some of the Iceberg options should not be >>>>>>>>> job-specific. >>>>>>>>> >>>>>>>>> For example, Iceberg has an option “locality” which only allows >>>>>>>>> setting at the job level, but Spark has a configuration >>>>>>>>> “spark.shuffle.reduceLocality.enabled” which allows setting at the >>>>>>>>> cluster >>>>>>>>> level, this is a gap block Spark administers migrate to Iceberg >>>>>>>>> because >>>>>>>>> they can not disable it at the cluster level. >>>>>>>>> >>>>>>>>> So, what’s the principle in the Iceberg of classifying a >>>>>>>>> configuration into SQLConf or OPTION? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Cheng Pan >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> > On Jul 5, 2023, at 16:26, Cheng Pan <pan3...@gmail.com> wrote: >>>>>>>>> > >>>>>>>>> > I would argue that the SQLConf way is more in line with Spark >>>>>>>>> user/administrator habits. >>>>>>>>> > >>>>>>>>> > It’s a common practice that Spark administrators set >>>>>>>>> configurations in spark-defaults.conf at the cluster level , and when >>>>>>>>> the >>>>>>>>> user has issues with their Spark SQL/Jobs, the first question they >>>>>>>>> asked >>>>>>>>> mostly is: can it be fixed by adding a spark configuration? >>>>>>>>> > >>>>>>>>> > The OPTIONS way brings additional learning efforts to Spark >>>>>>>>> users and how can Spark administrators set them at cluster level? >>>>>>>>> > >>>>>>>>> > Thanks, >>>>>>>>> > Cheng Pan >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> >> On Jun 17, 2023, at 04:01, Wing Yew Poon >>>>>>>>> <wyp...@cloudera.com.INVALID> wrote: >>>>>>>>> >> >>>>>>>>> >> Hi, >>>>>>>>> >> I recently put up a PR, >>>>>>>>> https://github.com/apache/iceberg/pull/7790, to allow the write >>>>>>>>> mode (copy-on-write/merge-on-read) to be specified in SQLConf. The >>>>>>>>> use case >>>>>>>>> is explained in the PR. >>>>>>>>> >> Cheng Pan has an open PR, >>>>>>>>> https://github.com/apache/iceberg/pull/7733, to allow locality to >>>>>>>>> be specified in SQLConf. >>>>>>>>> >> In the recent past, >>>>>>>>> https://github.com/apache/iceberg/pull/6838/ was a PR to allow >>>>>>>>> the write distribution mode to be specified in SQLConf. This was >>>>>>>>> merged. >>>>>>>>> >> Cheng Pan asks if there is any guidance on when we should allow >>>>>>>>> configs to be specified in SQLConf. >>>>>>>>> >> Thanks, >>>>>>>>> >> Wing Yew >>>>>>>>> >> >>>>>>>>> >> ps. The above open PRs could use reviews by committers. >>>>>>>>> >> >>>>>>>>> > >>>>>>>>> >>>>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>>