Re: allowing configs to be specified in SQLConf for Spark reads/writes

Szehon Ho Tue, 09 Jul 2024 22:36:12 -0700

Sure, the prs are https://github.com/apache/spark/pull/44119 (merge),
https://github.com/apache/spark/pull/47233 (update), and delete in progress.


Thanks
Szehon

On Tue, Jul 9, 2024 at 10:27 PM Wing Yew Poon <wyp...@cloudera.com.invalid>
wrote:

> Hi Szehon,
> Thanks for the update.
> Can you please point me to the work on supporting DELETE/UPDATE/MERGE in
> the DataFrame API?
> Thanks,
> Wing Yew
>
>
> On Tue, Jul 9, 2024 at 10:05 PM Szehon Ho <szehon.apa...@gmail.com> wrote:
>
>> Hi,
>>
>> Just FYI, good news, this change is merged on the Spark side :
>> https://github.com/apache/spark/pull/46707 (its the third effort!).  In
>> next version of Spark, we will be able to pass read properties via SQL to a
>> particular Iceberg table such as
>>
>> SELECT * FROM iceberg.db.table1 WITH (`locality` = `true`)
>>
>> I will look at write options after this.
>>
>> There's also progress in supporting DELETE/UPDATE/MERGE from Dataframes
>> as well, it should also be coming soon in Spark.
>>
>> Thanks,
>> Szehon
>>
>>
>>
>> On Wed, Jul 26, 2023 at 12:46 PM Wing Yew Poon
>> <wyp...@cloudera.com.invalid> wrote:
>>
>>> We are talking about DELETE/UPDATE/MERGE operations. There is only SQL
>>> support for these operations. There is no DataFrame API support for them.*
>>> Therefore write options are not applicable. Thus SQLConf is the only
>>> available mechanism I can use to override the table property.
>>> For reference, we currently support setting distribution mode using
>>> write option, SQLConf and table property. It seems to me that
>>> https://github.com/apache/iceberg/pull/6838/ is a precedent for what
>>> I'd like to do.
>>>
>>> * It would be of interest to support performing DELETE/UPDATE/MERGE from
>>> DataFrames, but that is a whole other topic.
>>>
>>>
>>> On Wed, Jul 26, 2023 at 12:04 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> I think we should aim to have the same behavior across properties that
>>>> are set in SQL conf, table config, and write options. Having SQL conf
>>>> override table config for this doesn't make sense to me. If the need is to
>>>> override table configuration, then write options are the right way to do 
>>>> it.
>>>>
>>>> On Wed, Jul 26, 2023 at 10:10 AM Wing Yew Poon
>>>> <wyp...@cloudera.com.invalid> wrote:
>>>>
>>>>> I was on vacation.
>>>>> Currently, write modes (copy-on-write/merge-on-read) can only be set
>>>>> as table properties, and default to copy-on-write. We have a customer who
>>>>> wants to use copy-on-write for certain Spark jobs that write to some
>>>>> Iceberg table and merge-on-read for other Spark jobs writing to the same
>>>>> table, because of the write characteristics of those jobs. This seems like
>>>>> a use case that should be supported. The only way they can do this
>>>>> currently is to toggle the table property as needed before doing the
>>>>> writes. This is not a sustainable workaround.
>>>>> Hence, I think it would be useful to be able to configure the write
>>>>> mode as a SQLConf. I also disagree that the table property should always
>>>>> win. If this is the case, there is no way to override it. The existing
>>>>> behavior in SparkConfParser is to use the option if set, else use the
>>>>> session conf if set, else use the table property. This applies across the
>>>>> board.
>>>>> - Wing Yew
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Jul 16, 2023 at 4:48 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> Yes, I agree that there is value for administrators from having some
>>>>>> things exposed as Spark SQL configuration. That gets much harder when you
>>>>>> want to use the SQLConf for table-level settings, though. For example, 
>>>>>> the
>>>>>> target split size is something that was an engine setting in the Hadoop
>>>>>> world, even though it makes no sense to use the same setting across 
>>>>>> vastly
>>>>>> different tables --- think about joining a fact table with a dimension
>>>>>> table.
>>>>>>
>>>>>> Settings like write mode are table-level settings. It matters what is
>>>>>> downstream of the table. You may want to set a *default* write mode, but
>>>>>> the table-level setting should always win. Currently, there are limits to
>>>>>> overriding the write mode in SQL. That's why we should add hints. For
>>>>>> anything beyond that, I think we need to discuss what you're trying to 
>>>>>> do.
>>>>>> If it's to override a table-level setting with a SQL global, then we 
>>>>>> should
>>>>>> understand the use case better.
>>>>>>
>>>>>> On Fri, Jul 14, 2023 at 6:09 PM Wing Yew Poon
>>>>>> <wyp...@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> Also, in the case of write mode (I mean write.delete.mode,
>>>>>>> write.update.mode, write.merge.mode), these cannot be set as options
>>>>>>> currently; they are only settable as table properties.
>>>>>>>
>>>>>>> On Fri, Jul 14, 2023 at 5:58 PM Wing Yew Poon <wyp...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think that different use cases benefit from or even require
>>>>>>>> different solutions. I think enabling options in Spark SQL is helpful, 
>>>>>>>> but
>>>>>>>> allowing some configurations to be done in SQLConf is also helpful.
>>>>>>>> For Cheng Pan's use case (to disable locality), I think providing a
>>>>>>>> conf (which can be added to spark-defaults.conf by a cluster admin) is
>>>>>>>> useful.
>>>>>>>> For my customer's use case (
>>>>>>>> https://github.com/apache/iceberg/pull/7790), being able to set
>>>>>>>> the write mode per Spark job (where right now it can only be set as a 
>>>>>>>> table
>>>>>>>> property) is useful. Allowing this to be done in the SQL with an
>>>>>>>> option/hint could also work, but as I understand it, Szehon's PR (
>>>>>>>> https://github.com/apache/spark/pull/416830) is only applicable to
>>>>>>>> reads, not writes.
>>>>>>>>
>>>>>>>> - Wing Yew
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 13, 2023 at 1:04 AM Cheng Pan <pan3...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ryan, I understand that option should be job-specific, and
>>>>>>>>> introducing an OPTIONS HINT can make Spark SQL achieves similar
>>>>>>>>> capabilities as DataFrame API does.
>>>>>>>>>
>>>>>>>>> My point is, some of the Iceberg options should not be
>>>>>>>>> job-specific.
>>>>>>>>>
>>>>>>>>> For example, Iceberg has an option “locality” which only allows
>>>>>>>>> setting at the job level, but Spark has a configuration
>>>>>>>>> “spark.shuffle.reduceLocality.enabled” which allows setting at the 
>>>>>>>>> cluster
>>>>>>>>> level, this is a gap block Spark administers migrate to Iceberg 
>>>>>>>>> because
>>>>>>>>> they can not disable it at the cluster level.
>>>>>>>>>
>>>>>>>>> So, what’s the principle in the Iceberg of classifying a
>>>>>>>>> configuration into SQLConf or OPTION?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Cheng Pan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > On Jul 5, 2023, at 16:26, Cheng Pan <pan3...@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > I would argue that the SQLConf way is more in line with Spark
>>>>>>>>> user/administrator habits.
>>>>>>>>> >
>>>>>>>>> > It’s a common practice that Spark administrators set
>>>>>>>>> configurations in spark-defaults.conf at the cluster level , and when 
>>>>>>>>> the
>>>>>>>>> user has issues with their Spark SQL/Jobs, the first question they 
>>>>>>>>> asked
>>>>>>>>> mostly is: can it be fixed by adding a spark configuration?
>>>>>>>>> >
>>>>>>>>> > The OPTIONS way brings additional learning efforts to Spark
>>>>>>>>> users and how can Spark administrators set them at cluster level?
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > Cheng Pan
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >> On Jun 17, 2023, at 04:01, Wing Yew Poon
>>>>>>>>> <wyp...@cloudera.com.INVALID> wrote:
>>>>>>>>> >>
>>>>>>>>> >> Hi,
>>>>>>>>> >> I recently put up a PR,
>>>>>>>>> https://github.com/apache/iceberg/pull/7790, to allow the write
>>>>>>>>> mode (copy-on-write/merge-on-read) to be specified in SQLConf. The 
>>>>>>>>> use case
>>>>>>>>> is explained in the PR.
>>>>>>>>> >> Cheng Pan has an open PR,
>>>>>>>>> https://github.com/apache/iceberg/pull/7733, to allow locality to
>>>>>>>>> be specified in SQLConf.
>>>>>>>>> >> In the recent past,
>>>>>>>>> https://github.com/apache/iceberg/pull/6838/ was a PR to allow
>>>>>>>>> the write distribution mode to be specified in SQLConf. This was 
>>>>>>>>> merged.
>>>>>>>>> >> Cheng Pan asks if there is any guidance on when we should allow
>>>>>>>>> configs to be specified in SQLConf.
>>>>>>>>> >> Thanks,
>>>>>>>>> >> Wing Yew
>>>>>>>>> >>
>>>>>>>>> >> ps. The above open PRs could use reviews by committers.
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>

Re: allowing configs to be specified in SQLConf for Spark reads/writes

Reply via email to