Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Yuming Wang Thu, 25 Apr 2024 20:14:17 -0700

+1

On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek <ofek.nim...@gmail.com> wrote:


> Of course, I can't think of a scenario of thousands of tables with single
> in memory Spark cluster with in memory catalog.
> Thanks for the help!
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>>
>>
>> Agreed. In scenarios where most of the interactions with the catalog are
>> related to query planning, saving and metadata management, the choice of
>> catalog implementation may have less impact on query runtime performance.
>> This is because the time spent on metadata operations is generally
>> minimal compared to the time spent on actual data fetching, processing, and
>> computation.
>> However, if we consider scalability and reliability concerns, especially
>> as the size and complexity of data and query workload grow. While an
>> in-memory catalog may offer excellent performance for smaller workloads,
>> it will face limitations in handling larger-scale deployments with
>> thousands of tables, partitions, and users. Additionally, durability and
>> persistence are crucial considerations, particularly in production
>> environments where data integrity
>> and availability are crucial. In-memory catalog implementations may lack
>> durability, meaning that metadata changes could be lost in the event of a
>> system failure or restart. Therefore, while in-memory catalog
>> implementations can provide speed and efficiency for certain use cases, we
>> ought to consider the requirements for scalability, reliability, and data
>> durability when choosing a catalog solution for production deployments. In
>> many cases, a combination of in-memory and disk-based catalog solutions may
>> offer the best balance of performance and resilience for demanding large
>> scale workloads.
>>
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>>
>>> Of course, but it's in memory and not persisted which is much faster,
>>> and as I said- I believe that most of the interaction with it is during the
>>> planning and save and not actual query run operations, and they are short
>>> and minimal compared to data fetching and manipulation so I don't believe
>>> it will have big impact on query run...
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>
>>>> Well, I will be surprised because Derby database is single threaded and
>>>> won't be much of a use here.
>>>>
>>>> Most Hive metastore in the commercial world utilise postgres or Oracle
>>>> for metastore that are battle proven, replicated and backed up.
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com>
>>>> wrote:
>>>>
>>>>> Yes, in memory hive catalog backed by local Derby DB.
>>>>> And again, I presume that most metadata related parts are during
>>>>> planning and not actual run, so I don't see why it should strongly affect
>>>>> query performance.
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>>>>> mich.talebza...@gmail.com>:
>>>>>
>>>>>> With regard to your point below
>>>>>>
>>>>>> "The thing I'm missing is this: let's say that the output format I
>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>>> Where
>>>>>> does the catalog implementation (which holds metadata afaik, same 
>>>>>> metadata
>>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>>> comes into play and why should it affect performance? "
>>>>>>
>>>>>> The catalog implementation comes into play regardless of the output
>>>>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>>>>>> responsible for managing metadata about the datasets, tables, schemas, 
>>>>>> and
>>>>>> other objects stored in aforementioned formats. Even though Delta Lake 
>>>>>> and
>>>>>> Iceberg have their metadata management mechanisms internally, they still
>>>>>> rely on the catalog for providing a unified interface for accessing and
>>>>>> manipulating metadata across different storage formats.
>>>>>>
>>>>>> "Another thing is that if I understand correctly, and I might be
>>>>>> totally wrong here, the internal spark catalog is a local installation of
>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>>> anything"
>>>>>>
>>>>>> .I don't understand this. Do you mean a Derby database?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>
>>>>>>
>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the detailed answer.
>>>>>>> The thing I'm missing is this: let's say that the output format I
>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>>>> Where
>>>>>>> does the catalog implementation (which holds metadata afaik, same 
>>>>>>> metadata
>>>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>>>> comes into play and why should it affect performance?
>>>>>>> Another thing is that if I understand correctly, and I might be
>>>>>>> totally wrong here, the internal spark catalog is a local installation 
>>>>>>> of
>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>>>> anything.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>
>>>>>>>> My take regarding your question is that your mileage varies so to
>>>>>>>> speak.
>>>>>>>>
>>>>>>>> 1) Hive provides a more mature and widely adopted catalog solution
>>>>>>>> that integrates well with other components in the Hadoop ecosystem, 
>>>>>>>> such as
>>>>>>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), 
>>>>>>>> using
>>>>>>>> Hive may offer better compatibility and interoperability.
>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users who
>>>>>>>> are accustomed to traditional RDBMs. If your use case involves complex 
>>>>>>>> SQL
>>>>>>>> queries or existing SQL-based workflows, using Hive may be 
>>>>>>>> advantageous.
>>>>>>>> 3) If you are looking for performance, spark's native catalog tends
>>>>>>>> to offer better performance for certain workloads, particularly those 
>>>>>>>> that
>>>>>>>> involve iterative processing or complex data transformations.(my
>>>>>>>> understanding). Spark's in-memory processing capabilities and 
>>>>>>>> optimizations
>>>>>>>> make it well-suited for interactive analytics and machine learning
>>>>>>>> tasks.(my favourite)
>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark for
>>>>>>>> data processing and analytics, using Spark's native catalog may 
>>>>>>>> simplify
>>>>>>>> workflow management and reduce overhead, Spark's  tight integration 
>>>>>>>> with
>>>>>>>> its catalog allows for seamless interaction with Spark applications and
>>>>>>>> libraries.
>>>>>>>> 5) There seems to be some similarity with spark catalog and
>>>>>>>> Databricks unity catalog, so that may favour the choice.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>>>> London
>>>>>>>> United Kingdom
>>>>>>>>
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>> note
>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>>> expert opinions (Werner
>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I will also appreciate some material that describes the
>>>>>>>>> differences between Spark native tables vs hive tables and why each 
>>>>>>>>> should
>>>>>>>>> be used...
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Nimrod
>>>>>>>>>
>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>>
>>>>>>>>>> I see a statement made as below  and I quote
>>>>>>>>>>
>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of
>>>>>>>>>> this
>>>>>>>>>> configuration from `true` to `false` to use Spark native tables
>>>>>>>>>> because
>>>>>>>>>> we support better."
>>>>>>>>>>
>>>>>>>>>> Can you please elaborate on the above specifically with regard to
>>>>>>>>>> the phrase ".. because
>>>>>>>>>> we support better."
>>>>>>>>>>
>>>>>>>>>> Are you referring to the performance of Spark catalog (I believe
>>>>>>>>>> it is internal) or integration with Spark?
>>>>>>>>>>
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>> FinCrime
>>>>>>>>>> London
>>>>>>>>>> United Kingdom
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>>>> note
>>>>>>>>>> that, as with any advice, quote "one test result is worth 
>>>>>>>>>> one-thousand
>>>>>>>>>> expert opinions (Werner
>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1
>>>>>>>>>>>>
>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Kent Yao
>>>>>>>>>>>>
>>>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道：
>>>>>>>>>>>> >
>>>>>>>>>>>> > Hi, All.
>>>>>>>>>>>> >
>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 more
>>>>>>>>>>>> and more.
>>>>>>>>>>>> > Thank you all.
>>>>>>>>>>>> >
>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from
>>>>>>>>>>>> the subtasks
>>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0),
>>>>>>>>>>>> >
>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>>>>>>>> >    Set `spark.sql.legacy.createHiveTableByDefault` to `false`
>>>>>>>>>>>> by default
>>>>>>>>>>>> >
>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax
>>>>>>>>>>>> without
>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive`
>>>>>>>>>>>> table.
>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value of
>>>>>>>>>>>> this
>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native
>>>>>>>>>>>> tables because
>>>>>>>>>>>> > we support better.
>>>>>>>>>>>> >
>>>>>>>>>>>> > In other words, Spark will use the value of
>>>>>>>>>>>> `spark.sql.sources.default`
>>>>>>>>>>>> > as the table provider instead of `Hive` like the other Spark
>>>>>>>>>>>> APIs. Of course,
>>>>>>>>>>>> > the users can get all the legacy behavior by setting back to
>>>>>>>>>>>> `true`.
>>>>>>>>>>>> >
>>>>>>>>>>>> > Historically, this behavior change was merged once at Apache
>>>>>>>>>>>> Spark 3.0.0
>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during the
>>>>>>>>>>>> 3.0.0 RC period.
>>>>>>>>>>>> >
>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider
>>>>>>>>>>>> for CREATE TABLE
>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default
>>>>>>>>>>>> datasource as
>>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>>> >
>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this
>>>>>>>>>>>> and defined it
>>>>>>>>>>>> > as one of legacy behavior via this configuration via reused
>>>>>>>>>>>> ID, SPARK-30098.
>>>>>>>>>>>> >
>>>>>>>>>>>> > 2020-12-01:
>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default
>>>>>>>>>>>> datasource as
>>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>>> >
>>>>>>>>>>>> > Last year, we received two additional requests twice to
>>>>>>>>>>>> switch this because
>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the
>>>>>>>>>>>> future direction.
>>>>>>>>>>>> >
>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR
>>>>>>>>>>>> which is one line of main
>>>>>>>>>>>> > code, one line of migration guide, and a few lines of test
>>>>>>>>>>>> code.
>>>>>>>>>>>> >
>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>>>>>>>> >
>>>>>>>>>>>> > Dongjoon.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to