Hi Szehon, Thanks for the update. Can you please point me to the work on supporting DELETE/UPDATE/MERGE in the DataFrame API? Thanks, Wing Yew
On Tue, Jul 9, 2024 at 10:05 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > Hi, > > Just FYI, good news, this change is merged on the Spark side : > https://github.com/apache/spark/pull/46707 (its the third effort!). In > next version of Spark, we will be able to pass read properties via SQL to a > particular Iceberg table such as > > SELECT * FROM iceberg.db.table1 WITH (`locality` = `true`) > > I will look at write options after this. > > There's also progress in supporting DELETE/UPDATE/MERGE from Dataframes as > well, it should also be coming soon in Spark. > > Thanks, > Szehon > > > > On Wed, Jul 26, 2023 at 12:46 PM Wing Yew Poon <wyp...@cloudera.com.invalid> > wrote: > >> We are talking about DELETE/UPDATE/MERGE operations. There is only SQL >> support for these operations. There is no DataFrame API support for them.* >> Therefore write options are not applicable. Thus SQLConf is the only >> available mechanism I can use to override the table property. >> For reference, we currently support setting distribution mode using write >> option, SQLConf and table property. It seems to me that >> https://github.com/apache/iceberg/pull/6838/ is a precedent for what I'd >> like to do. >> >> * It would be of interest to support performing DELETE/UPDATE/MERGE from >> DataFrames, but that is a whole other topic. >> >> >> On Wed, Jul 26, 2023 at 12:04 PM Ryan Blue <b...@tabular.io> wrote: >> >>> I think we should aim to have the same behavior across properties that >>> are set in SQL conf, table config, and write options. Having SQL conf >>> override table config for this doesn't make sense to me. If the need is to >>> override table configuration, then write options are the right way to do it. >>> >>> On Wed, Jul 26, 2023 at 10:10 AM Wing Yew Poon >>> <wyp...@cloudera.com.invalid> wrote: >>> >>>> I was on vacation. >>>> Currently, write modes (copy-on-write/merge-on-read) can only be set as >>>> table properties, and default to copy-on-write. We have a customer who >>>> wants to use copy-on-write for certain Spark jobs that write to some >>>> Iceberg table and merge-on-read for other Spark jobs writing to the same >>>> table, because of the write characteristics of those jobs. This seems like >>>> a use case that should be supported. The only way they can do this >>>> currently is to toggle the table property as needed before doing the >>>> writes. This is not a sustainable workaround. >>>> Hence, I think it would be useful to be able to configure the write >>>> mode as a SQLConf. I also disagree that the table property should always >>>> win. If this is the case, there is no way to override it. The existing >>>> behavior in SparkConfParser is to use the option if set, else use the >>>> session conf if set, else use the table property. This applies across the >>>> board. >>>> - Wing Yew >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Jul 16, 2023 at 4:48 PM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> Yes, I agree that there is value for administrators from having some >>>>> things exposed as Spark SQL configuration. That gets much harder when you >>>>> want to use the SQLConf for table-level settings, though. For example, the >>>>> target split size is something that was an engine setting in the Hadoop >>>>> world, even though it makes no sense to use the same setting across vastly >>>>> different tables --- think about joining a fact table with a dimension >>>>> table. >>>>> >>>>> Settings like write mode are table-level settings. It matters what is >>>>> downstream of the table. You may want to set a *default* write mode, but >>>>> the table-level setting should always win. Currently, there are limits to >>>>> overriding the write mode in SQL. That's why we should add hints. For >>>>> anything beyond that, I think we need to discuss what you're trying to do. >>>>> If it's to override a table-level setting with a SQL global, then we >>>>> should >>>>> understand the use case better. >>>>> >>>>> On Fri, Jul 14, 2023 at 6:09 PM Wing Yew Poon >>>>> <wyp...@cloudera.com.invalid> wrote: >>>>> >>>>>> Also, in the case of write mode (I mean write.delete.mode, >>>>>> write.update.mode, write.merge.mode), these cannot be set as options >>>>>> currently; they are only settable as table properties. >>>>>> >>>>>> On Fri, Jul 14, 2023 at 5:58 PM Wing Yew Poon <wyp...@cloudera.com> >>>>>> wrote: >>>>>> >>>>>>> I think that different use cases benefit from or even require >>>>>>> different solutions. I think enabling options in Spark SQL is helpful, >>>>>>> but >>>>>>> allowing some configurations to be done in SQLConf is also helpful. >>>>>>> For Cheng Pan's use case (to disable locality), I think providing a >>>>>>> conf (which can be added to spark-defaults.conf by a cluster admin) is >>>>>>> useful. >>>>>>> For my customer's use case ( >>>>>>> https://github.com/apache/iceberg/pull/7790), being able to set the >>>>>>> write mode per Spark job (where right now it can only be set as a table >>>>>>> property) is useful. Allowing this to be done in the SQL with an >>>>>>> option/hint could also work, but as I understand it, Szehon's PR ( >>>>>>> https://github.com/apache/spark/pull/416830) is only applicable to >>>>>>> reads, not writes. >>>>>>> >>>>>>> - Wing Yew >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 13, 2023 at 1:04 AM Cheng Pan <pan3...@gmail.com> wrote: >>>>>>> >>>>>>>> Ryan, I understand that option should be job-specific, and >>>>>>>> introducing an OPTIONS HINT can make Spark SQL achieves similar >>>>>>>> capabilities as DataFrame API does. >>>>>>>> >>>>>>>> My point is, some of the Iceberg options should not be job-specific. >>>>>>>> >>>>>>>> For example, Iceberg has an option “locality” which only allows >>>>>>>> setting at the job level, but Spark has a configuration >>>>>>>> “spark.shuffle.reduceLocality.enabled” which allows setting at the >>>>>>>> cluster >>>>>>>> level, this is a gap block Spark administers migrate to Iceberg because >>>>>>>> they can not disable it at the cluster level. >>>>>>>> >>>>>>>> So, what’s the principle in the Iceberg of classifying a >>>>>>>> configuration into SQLConf or OPTION? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Cheng Pan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> > On Jul 5, 2023, at 16:26, Cheng Pan <pan3...@gmail.com> wrote: >>>>>>>> > >>>>>>>> > I would argue that the SQLConf way is more in line with Spark >>>>>>>> user/administrator habits. >>>>>>>> > >>>>>>>> > It’s a common practice that Spark administrators set >>>>>>>> configurations in spark-defaults.conf at the cluster level , and when >>>>>>>> the >>>>>>>> user has issues with their Spark SQL/Jobs, the first question they >>>>>>>> asked >>>>>>>> mostly is: can it be fixed by adding a spark configuration? >>>>>>>> > >>>>>>>> > The OPTIONS way brings additional learning efforts to Spark users >>>>>>>> and how can Spark administrators set them at cluster level? >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > Cheng Pan >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> >> On Jun 17, 2023, at 04:01, Wing Yew Poon >>>>>>>> <wyp...@cloudera.com.INVALID> wrote: >>>>>>>> >> >>>>>>>> >> Hi, >>>>>>>> >> I recently put up a PR, >>>>>>>> https://github.com/apache/iceberg/pull/7790, to allow the write >>>>>>>> mode (copy-on-write/merge-on-read) to be specified in SQLConf. The use >>>>>>>> case >>>>>>>> is explained in the PR. >>>>>>>> >> Cheng Pan has an open PR, >>>>>>>> https://github.com/apache/iceberg/pull/7733, to allow locality to >>>>>>>> be specified in SQLConf. >>>>>>>> >> In the recent past, https://github.com/apache/iceberg/pull/6838/ >>>>>>>> was a PR to allow the write distribution mode to be specified in >>>>>>>> SQLConf. >>>>>>>> This was merged. >>>>>>>> >> Cheng Pan asks if there is any guidance on when we should allow >>>>>>>> configs to be specified in SQLConf. >>>>>>>> >> Thanks, >>>>>>>> >> Wing Yew >>>>>>>> >> >>>>>>>> >> ps. The above open PRs could use reviews by committers. >>>>>>>> >> >>>>>>>> > >>>>>>>> >>>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >>