Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Nimrod Ofek Thu, 25 Apr 2024 14:21:49 -0700

Of course, I can't think of a scenario of thousands of tables with single
in memory Spark cluster with in memory catalog.
Thanks for the help!


בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
mich.talebza...@gmail.com>:

>
>
> Agreed. In scenarios where most of the interactions with the catalog are
> related to query planning, saving and metadata management, the choice of
> catalog implementation may have less impact on query runtime performance.
> This is because the time spent on metadata operations is generally minimal
> compared to the time spent on actual data fetching, processing, and
> computation.
> However, if we consider scalability and reliability concerns, especially
> as the size and complexity of data and query workload grow. While an
> in-memory catalog may offer excellent performance for smaller workloads,
> it will face limitations in handling larger-scale deployments with
> thousands of tables, partitions, and users. Additionally, durability and
> persistence are crucial considerations, particularly in production
> environments where data integrity
> and availability are crucial. In-memory catalog implementations may lack
> durability, meaning that metadata changes could be lost in the event of a
> system failure or restart. Therefore, while in-memory catalog
> implementations can provide speed and efficiency for certain use cases, we
> ought to consider the requirements for scalability, reliability, and data
> durability when choosing a catalog solution for production deployments. In
> many cases, a combination of in-memory and disk-based catalog solutions may
> offer the best balance of performance and resilience for demanding large
> scale workloads.
>
>
> HTH
>
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
>> Of course, but it's in memory and not persisted which is much faster, and
>> as I said- I believe that most of the interaction with it is during the
>> planning and save and not actual query run operations, and they are short
>> and minimal compared to data fetching and manipulation so I don't believe
>> it will have big impact on query run...
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>> Well, I will be surprised because Derby database is single threaded and
>>> won't be much of a use here.
>>>
>>> Most Hive metastore in the commercial world utilise postgres or Oracle
>>> for metastore that are battle proven, replicated and backed up.
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>>>
>>>> Yes, in memory hive catalog backed by local Derby DB.
>>>> And again, I presume that most metadata related parts are during
>>>> planning and not actual run, so I don't see why it should strongly affect
>>>> query performance.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>>>> mich.talebza...@gmail.com>:
>>>>
>>>>> With regard to your point below
>>>>>
>>>>> "The thing I'm missing is this: let's say that the output format I
>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>> Where
>>>>> does the catalog implementation (which holds metadata afaik, same metadata
>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>> comes into play and why should it affect performance? "
>>>>>
>>>>> The catalog implementation comes into play regardless of the output
>>>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>>>>> responsible for managing metadata about the datasets, tables, schemas, and
>>>>> other objects stored in aforementioned formats. Even though Delta Lake and
>>>>> Iceberg have their metadata management mechanisms internally, they still
>>>>> rely on the catalog for providing a unified interface for accessing and
>>>>> manipulating metadata across different storage formats.
>>>>>
>>>>> "Another thing is that if I understand correctly, and I might be
>>>>> totally wrong here, the internal spark catalog is a local installation of
>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>> anything"
>>>>>
>>>>> .I don't understand this. Do you mean a Derby database?
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>
>>>>>
>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the detailed answer.
>>>>>> The thing I'm missing is this: let's say that the output format I
>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>>> Where
>>>>>> does the catalog implementation (which holds metadata afaik, same 
>>>>>> metadata
>>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>>> comes into play and why should it affect performance?
>>>>>> Another thing is that if I understand correctly, and I might be
>>>>>> totally wrong here, the internal spark catalog is a local installation of
>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>>> anything.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>>>>> mich.talebza...@gmail.com>:
>>>>>>
>>>>>>> My take regarding your question is that your mileage varies so to
>>>>>>> speak.
>>>>>>>
>>>>>>> 1) Hive provides a more mature and widely adopted catalog solution
>>>>>>> that integrates well with other components in the Hadoop ecosystem, 
>>>>>>> such as
>>>>>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), 
>>>>>>> using
>>>>>>> Hive may offer better compatibility and interoperability.
>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users who
>>>>>>> are accustomed to traditional RDBMs. If your use case involves complex 
>>>>>>> SQL
>>>>>>> queries or existing SQL-based workflows, using Hive may be advantageous.
>>>>>>> 3) If you are looking for performance, spark's native catalog tends
>>>>>>> to offer better performance for certain workloads, particularly those 
>>>>>>> that
>>>>>>> involve iterative processing or complex data transformations.(my
>>>>>>> understanding). Spark's in-memory processing capabilities and 
>>>>>>> optimizations
>>>>>>> make it well-suited for interactive analytics and machine learning
>>>>>>> tasks.(my favourite)
>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark for
>>>>>>> data processing and analytics, using Spark's native catalog may simplify
>>>>>>> workflow management and reduce overhead, Spark's  tight integration with
>>>>>>> its catalog allows for seamless interaction with Spark applications and
>>>>>>> libraries.
>>>>>>> 5) There seems to be some similarity with spark catalog and
>>>>>>> Databricks unity catalog, so that may favour the choice.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>> expert opinions (Werner
>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I will also appreciate some material that describes the differences
>>>>>>>> between Spark native tables vs hive tables and why each should be 
>>>>>>>> used...
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Nimrod
>>>>>>>>
>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>
>>>>>>>>> I see a statement made as below  and I quote
>>>>>>>>>
>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of this
>>>>>>>>> configuration from `true` to `false` to use Spark native tables
>>>>>>>>> because
>>>>>>>>> we support better."
>>>>>>>>>
>>>>>>>>> Can you please elaborate on the above specifically with regard to
>>>>>>>>> the phrase ".. because
>>>>>>>>> we support better."
>>>>>>>>>
>>>>>>>>> Are you referring to the performance of Spark catalog (I believe
>>>>>>>>> it is internal) or integration with Spark?
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Mich Talebzadeh,
>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>> FinCrime
>>>>>>>>> London
>>>>>>>>> United Kingdom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>>> note
>>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>>>> expert opinions (Werner
>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1
>>>>>>>>>>>
>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Kent Yao
>>>>>>>>>>>
>>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道：
>>>>>>>>>>> >
>>>>>>>>>>> > Hi, All.
>>>>>>>>>>> >
>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 more
>>>>>>>>>>> and more.
>>>>>>>>>>> > Thank you all.
>>>>>>>>>>> >
>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from
>>>>>>>>>>> the subtasks
>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0),
>>>>>>>>>>> >
>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>>>>>>> >    Set `spark.sql.legacy.createHiveTableByDefault` to `false`
>>>>>>>>>>> by default
>>>>>>>>>>> >
>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax
>>>>>>>>>>> without
>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive`
>>>>>>>>>>> table.
>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value of
>>>>>>>>>>> this
>>>>>>>>>>> > configuration from `true` to `false` to use Spark native
>>>>>>>>>>> tables because
>>>>>>>>>>> > we support better.
>>>>>>>>>>> >
>>>>>>>>>>> > In other words, Spark will use the value of
>>>>>>>>>>> `spark.sql.sources.default`
>>>>>>>>>>> > as the table provider instead of `Hive` like the other Spark
>>>>>>>>>>> APIs. Of course,
>>>>>>>>>>> > the users can get all the legacy behavior by setting back to
>>>>>>>>>>> `true`.
>>>>>>>>>>> >
>>>>>>>>>>> > Historically, this behavior change was merged once at Apache
>>>>>>>>>>> Spark 3.0.0
>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during the
>>>>>>>>>>> 3.0.0 RC period.
>>>>>>>>>>> >
>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for
>>>>>>>>>>> CREATE TABLE
>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default
>>>>>>>>>>> datasource as
>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>> >
>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this
>>>>>>>>>>> and defined it
>>>>>>>>>>> > as one of legacy behavior via this configuration via reused
>>>>>>>>>>> ID, SPARK-30098.
>>>>>>>>>>> >
>>>>>>>>>>> > 2020-12-01:
>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default
>>>>>>>>>>> datasource as
>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>> >
>>>>>>>>>>> > Last year, we received two additional requests twice to switch
>>>>>>>>>>> this because
>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the
>>>>>>>>>>> future direction.
>>>>>>>>>>> >
>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR which
>>>>>>>>>>> is one line of main
>>>>>>>>>>> > code, one line of migration guide, and a few lines of test
>>>>>>>>>>> code.
>>>>>>>>>>> >
>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>>>>>>> >
>>>>>>>>>>> > Dongjoon.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to