Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Nimrod Ofek Thu, 25 Apr 2024 07:39:45 -0700

Yes, in memory hive catalog backed by local Derby DB.
And again, I presume that most metadata related parts are during planning
and not actual run, so I don't see why it should strongly affect query
performance.


Thanks,


בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

> With regard to your point below
>
> "The thing I'm missing is this: let's say that the output format I choose
> is delta lake or iceberg or whatever format that uses parquet. Where does
> the catalog implementation (which holds metadata afaik, same metadata that
> iceberg and delta lake save for their tables about their columns) comes
> into play and why should it affect performance? "
>
> The catalog implementation comes into play regardless of the output format
> chosen (Delta Lake, Iceberg, Parquet, etc.) because it is responsible for
> managing metadata about the datasets, tables, schemas, and other objects
> stored in aforementioned formats. Even though Delta Lake and Iceberg have
> their metadata management mechanisms internally, they still rely on the
> catalog for providing a unified interface for accessing and manipulating
> metadata across different storage formats.
>
> "Another thing is that if I understand correctly, and I might be totally
> wrong here, the internal spark catalog is a local installation of hive
> metastore anyway, so I'm not sure what the catalog has to do with anything"
>
> .I don't understand this. Do you mean a Derby database?
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
>> Thanks for the detailed answer.
>> The thing I'm missing is this: let's say that the output format I choose
>> is delta lake or iceberg or whatever format that uses parquet. Where does
>> the catalog implementation (which holds metadata afaik, same metadata that
>> iceberg and delta lake save for their tables about their columns) comes
>> into play and why should it affect performance?
>> Another thing is that if I understand correctly, and I might be totally
>> wrong here, the internal spark catalog is a local installation of hive
>> metastore anyway, so I'm not sure what the catalog has to do with anything.
>>
>> Thanks!
>>
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> My take regarding your question is that your mileage varies so to speak.
>>>
>>> 1) Hive provides a more mature and widely adopted catalog solution that
>>> integrates well with other components in the Hadoop ecosystem, such as
>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
>>> Hive may offer better compatibility and interoperability.
>>> 2) Hive provides a SQL-like interface that is familiar to users who are
>>> accustomed to traditional RDBMs. If your use case involves complex SQL
>>> queries or existing SQL-based workflows, using Hive may be advantageous.
>>> 3) If you are looking for performance, spark's native catalog tends to
>>> offer better performance for certain workloads, particularly those that
>>> involve iterative processing or complex data transformations.(my
>>> understanding). Spark's in-memory processing capabilities and optimizations
>>> make it well-suited for interactive analytics and machine learning
>>> tasks.(my favourite)
>>> 4) Integration with Spark Workflows: If you primarily use Spark for data
>>> processing and analytics, using Spark's native catalog may simplify
>>> workflow management and reduce overhead, Spark's  tight integration with
>>> its catalog allows for seamless interaction with Spark applications and
>>> libraries.
>>> 5) There seems to be some similarity with spark catalog and
>>> Databricks unity catalog, so that may favour the choice.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>>>
>>>> I will also appreciate some material that describes the differences
>>>> between Spark native tables vs hive tables and why each should be used...
>>>>
>>>> Thanks
>>>> Nimrod
>>>>
>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>>> mich.talebza...@gmail.com>:
>>>>
>>>>> I see a statement made as below  and I quote
>>>>>
>>>>> "The proposal of SPARK-46122 is to switch the default value of this
>>>>> configuration from `true` to `false` to use Spark native tables because
>>>>> we support better."
>>>>>
>>>>> Can you please elaborate on the above specifically with regard to the
>>>>> phrase ".. because
>>>>> we support better."
>>>>>
>>>>> Are you referring to the performance of Spark catalog (I believe it is
>>>>> internal) or integration with Spark?
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>
>>>>>
>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kent Yao
>>>>>>>
>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道：
>>>>>>> >
>>>>>>> > Hi, All.
>>>>>>> >
>>>>>>> > It's great to see community activities to polish 4.0.0 more and
>>>>>>> more.
>>>>>>> > Thank you all.
>>>>>>> >
>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
>>>>>>> subtasks
>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0),
>>>>>>> >
>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>>> >    Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
>>>>>>> default
>>>>>>> >
>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax
>>>>>>> without
>>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
>>>>>>> > The proposal of SPARK-46122 is to switch the default value of this
>>>>>>> > configuration from `true` to `false` to use Spark native tables
>>>>>>> because
>>>>>>> > we support better.
>>>>>>> >
>>>>>>> > In other words, Spark will use the value of
>>>>>>> `spark.sql.sources.default`
>>>>>>> > as the table provider instead of `Hive` like the other Spark APIs.
>>>>>>> Of course,
>>>>>>> > the users can get all the legacy behavior by setting back to
>>>>>>> `true`.
>>>>>>> >
>>>>>>> > Historically, this behavior change was merged once at Apache Spark
>>>>>>> 3.0.0
>>>>>>> > preparation via SPARK-30098 already, but reverted during the 3.0.0
>>>>>>> RC period.
>>>>>>> >
>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for
>>>>>>> CREATE TABLE
>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource
>>>>>>> as
>>>>>>> >             provider for CREATE TABLE command
>>>>>>> >
>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this and
>>>>>>> defined it
>>>>>>> > as one of legacy behavior via this configuration via reused ID,
>>>>>>> SPARK-30098.
>>>>>>> >
>>>>>>> > 2020-12-01:
>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default
>>>>>>> datasource as
>>>>>>> >             provider for CREATE TABLE command
>>>>>>> >
>>>>>>> > Last year, we received two additional requests twice to switch
>>>>>>> this because
>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the
>>>>>>> future direction.
>>>>>>> >
>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>>>>>> >
>>>>>>> >
>>>>>>> > WDYT? The technical scope is defined in the following PR which is
>>>>>>> one line of main
>>>>>>> > code, one line of migration guide, and a few lines of test code.
>>>>>>> >
>>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>>> >
>>>>>>> > Dongjoon.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to