Re: allowing configs to be specified in SQLConf for Spark reads/writes

Wing Yew Poon Tue, 09 Jul 2024 22:27:26 -0700

Hi Szehon,
Thanks for the update.
Can you please point me to the work on supporting DELETE/UPDATE/MERGE in
the DataFrame API?
Thanks,
Wing Yew



On Tue, Jul 9, 2024 at 10:05 PM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Hi,
>
> Just FYI, good news, this change is merged on the Spark side :
> https://github.com/apache/spark/pull/46707 (its the third effort!).  In
> next version of Spark, we will be able to pass read properties via SQL to a
> particular Iceberg table such as
>
> SELECT * FROM iceberg.db.table1 WITH (`locality` = `true`)
>
> I will look at write options after this.
>
> There's also progress in supporting DELETE/UPDATE/MERGE from Dataframes as
> well, it should also be coming soon in Spark.
>
> Thanks,
> Szehon
>
>
>
> On Wed, Jul 26, 2023 at 12:46 PM Wing Yew Poon <wyp...@cloudera.com.invalid>
> wrote:
>
>> We are talking about DELETE/UPDATE/MERGE operations. There is only SQL
>> support for these operations. There is no DataFrame API support for them.*
>> Therefore write options are not applicable. Thus SQLConf is the only
>> available mechanism I can use to override the table property.
>> For reference, we currently support setting distribution mode using write
>> option, SQLConf and table property. It seems to me that
>> https://github.com/apache/iceberg/pull/6838/ is a precedent for what I'd
>> like to do.
>>
>> * It would be of interest to support performing DELETE/UPDATE/MERGE from
>> DataFrames, but that is a whole other topic.
>>
>>
>> On Wed, Jul 26, 2023 at 12:04 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> I think we should aim to have the same behavior across properties that
>>> are set in SQL conf, table config, and write options. Having SQL conf
>>> override table config for this doesn't make sense to me. If the need is to
>>> override table configuration, then write options are the right way to do it.
>>>
>>> On Wed, Jul 26, 2023 at 10:10 AM Wing Yew Poon
>>> <wyp...@cloudera.com.invalid> wrote:
>>>
>>>> I was on vacation.
>>>> Currently, write modes (copy-on-write/merge-on-read) can only be set as
>>>> table properties, and default to copy-on-write. We have a customer who
>>>> wants to use copy-on-write for certain Spark jobs that write to some
>>>> Iceberg table and merge-on-read for other Spark jobs writing to the same
>>>> table, because of the write characteristics of those jobs. This seems like
>>>> a use case that should be supported. The only way they can do this
>>>> currently is to toggle the table property as needed before doing the
>>>> writes. This is not a sustainable workaround.
>>>> Hence, I think it would be useful to be able to configure the write
>>>> mode as a SQLConf. I also disagree that the table property should always
>>>> win. If this is the case, there is no way to override it. The existing
>>>> behavior in SparkConfParser is to use the option if set, else use the
>>>> session conf if set, else use the table property. This applies across the
>>>> board.
>>>> - Wing Yew
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Jul 16, 2023 at 4:48 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> Yes, I agree that there is value for administrators from having some
>>>>> things exposed as Spark SQL configuration. That gets much harder when you
>>>>> want to use the SQLConf for table-level settings, though. For example, the
>>>>> target split size is something that was an engine setting in the Hadoop
>>>>> world, even though it makes no sense to use the same setting across vastly
>>>>> different tables --- think about joining a fact table with a dimension
>>>>> table.
>>>>>
>>>>> Settings like write mode are table-level settings. It matters what is
>>>>> downstream of the table. You may want to set a *default* write mode, but
>>>>> the table-level setting should always win. Currently, there are limits to
>>>>> overriding the write mode in SQL. That's why we should add hints. For
>>>>> anything beyond that, I think we need to discuss what you're trying to do.
>>>>> If it's to override a table-level setting with a SQL global, then we 
>>>>> should
>>>>> understand the use case better.
>>>>>
>>>>> On Fri, Jul 14, 2023 at 6:09 PM Wing Yew Poon
>>>>> <wyp...@cloudera.com.invalid> wrote:
>>>>>
>>>>>> Also, in the case of write mode (I mean write.delete.mode,
>>>>>> write.update.mode, write.merge.mode), these cannot be set as options
>>>>>> currently; they are only settable as table properties.
>>>>>>
>>>>>> On Fri, Jul 14, 2023 at 5:58 PM Wing Yew Poon <wyp...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think that different use cases benefit from or even require
>>>>>>> different solutions. I think enabling options in Spark SQL is helpful, 
>>>>>>> but
>>>>>>> allowing some configurations to be done in SQLConf is also helpful.
>>>>>>> For Cheng Pan's use case (to disable locality), I think providing a
>>>>>>> conf (which can be added to spark-defaults.conf by a cluster admin) is
>>>>>>> useful.
>>>>>>> For my customer's use case (
>>>>>>> https://github.com/apache/iceberg/pull/7790), being able to set the
>>>>>>> write mode per Spark job (where right now it can only be set as a table
>>>>>>> property) is useful. Allowing this to be done in the SQL with an
>>>>>>> option/hint could also work, but as I understand it, Szehon's PR (
>>>>>>> https://github.com/apache/spark/pull/416830) is only applicable to
>>>>>>> reads, not writes.
>>>>>>>
>>>>>>> - Wing Yew
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 13, 2023 at 1:04 AM Cheng Pan <pan3...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ryan, I understand that option should be job-specific, and
>>>>>>>> introducing an OPTIONS HINT can make Spark SQL achieves similar
>>>>>>>> capabilities as DataFrame API does.
>>>>>>>>
>>>>>>>> My point is, some of the Iceberg options should not be job-specific.
>>>>>>>>
>>>>>>>> For example, Iceberg has an option “locality” which only allows
>>>>>>>> setting at the job level, but Spark has a configuration
>>>>>>>> “spark.shuffle.reduceLocality.enabled” which allows setting at the 
>>>>>>>> cluster
>>>>>>>> level, this is a gap block Spark administers migrate to Iceberg because
>>>>>>>> they can not disable it at the cluster level.
>>>>>>>>
>>>>>>>> So, what’s the principle in the Iceberg of classifying a
>>>>>>>> configuration into SQLConf or OPTION?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Cheng Pan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > On Jul 5, 2023, at 16:26, Cheng Pan <pan3...@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > I would argue that the SQLConf way is more in line with Spark
>>>>>>>> user/administrator habits.
>>>>>>>> >
>>>>>>>> > It’s a common practice that Spark administrators set
>>>>>>>> configurations in spark-defaults.conf at the cluster level , and when 
>>>>>>>> the
>>>>>>>> user has issues with their Spark SQL/Jobs, the first question they 
>>>>>>>> asked
>>>>>>>> mostly is: can it be fixed by adding a spark configuration?
>>>>>>>> >
>>>>>>>> > The OPTIONS way brings additional learning efforts to Spark users
>>>>>>>> and how can Spark administrators set them at cluster level?
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Cheng Pan
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >> On Jun 17, 2023, at 04:01, Wing Yew Poon
>>>>>>>> <wyp...@cloudera.com.INVALID> wrote:
>>>>>>>> >>
>>>>>>>> >> Hi,
>>>>>>>> >> I recently put up a PR,
>>>>>>>> https://github.com/apache/iceberg/pull/7790, to allow the write
>>>>>>>> mode (copy-on-write/merge-on-read) to be specified in SQLConf. The use 
>>>>>>>> case
>>>>>>>> is explained in the PR.
>>>>>>>> >> Cheng Pan has an open PR,
>>>>>>>> https://github.com/apache/iceberg/pull/7733, to allow locality to
>>>>>>>> be specified in SQLConf.
>>>>>>>> >> In the recent past, https://github.com/apache/iceberg/pull/6838/
>>>>>>>> was a PR to allow the write distribution mode to be specified in 
>>>>>>>> SQLConf.
>>>>>>>> This was merged.
>>>>>>>> >> Cheng Pan asks if there is any guidance on when we should allow
>>>>>>>> configs to be specified in SQLConf.
>>>>>>>> >> Thanks,
>>>>>>>> >> Wing Yew
>>>>>>>> >>
>>>>>>>> >> ps. The above open PRs could use reviews by committers.
>>>>>>>> >>
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: allowing configs to be specified in SQLConf for Spark reads/writes

Reply via email to