Of course, I can't think of a scenario of thousands of tables with single in memory Spark cluster with in memory catalog. Thanks for the help!
בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh < mich.talebza...@gmail.com>: > > > Agreed. In scenarios where most of the interactions with the catalog are > related to query planning, saving and metadata management, the choice of > catalog implementation may have less impact on query runtime performance. > This is because the time spent on metadata operations is generally minimal > compared to the time spent on actual data fetching, processing, and > computation. > However, if we consider scalability and reliability concerns, especially > as the size and complexity of data and query workload grow. While an > in-memory catalog may offer excellent performance for smaller workloads, > it will face limitations in handling larger-scale deployments with > thousands of tables, partitions, and users. Additionally, durability and > persistence are crucial considerations, particularly in production > environments where data integrity > and availability are crucial. In-memory catalog implementations may lack > durability, meaning that metadata changes could be lost in the event of a > system failure or restart. Therefore, while in-memory catalog > implementations can provide speed and efficiency for certain use cases, we > ought to consider the requirements for scalability, reliability, and data > durability when choosing a catalog solution for production deployments. In > many cases, a combination of in-memory and disk-based catalog solutions may > offer the best balance of performance and resilience for demanding large > scale workloads. > > > HTH > > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative AI | FinCrime > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com> wrote: > >> Of course, but it's in memory and not persisted which is much faster, and >> as I said- I believe that most of the interaction with it is during the >> planning and save and not actual query run operations, and they are short >> and minimal compared to data fetching and manipulation so I don't believe >> it will have big impact on query run... >> >> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh < >> mich.talebza...@gmail.com>: >> >>> Well, I will be surprised because Derby database is single threaded and >>> won't be much of a use here. >>> >>> Most Hive metastore in the commercial world utilise postgres or Oracle >>> for metastore that are battle proven, replicated and backed up. >>> >>> Mich Talebzadeh, >>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* The information provided is correct to the best of my >>> knowledge but of course cannot be guaranteed . It is essential to note >>> that, as with any advice, quote "one test result is worth one-thousand >>> expert opinions (Werner >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>> >>> >>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com> wrote: >>> >>>> Yes, in memory hive catalog backed by local Derby DB. >>>> And again, I presume that most metadata related parts are during >>>> planning and not actual run, so I don't see why it should strongly affect >>>> query performance. >>>> >>>> Thanks, >>>> >>>> >>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh < >>>> mich.talebza...@gmail.com>: >>>> >>>>> With regard to your point below >>>>> >>>>> "The thing I'm missing is this: let's say that the output format I >>>>> choose is delta lake or iceberg or whatever format that uses parquet. >>>>> Where >>>>> does the catalog implementation (which holds metadata afaik, same metadata >>>>> that iceberg and delta lake save for their tables about their columns) >>>>> comes into play and why should it affect performance? " >>>>> >>>>> The catalog implementation comes into play regardless of the output >>>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is >>>>> responsible for managing metadata about the datasets, tables, schemas, and >>>>> other objects stored in aforementioned formats. Even though Delta Lake and >>>>> Iceberg have their metadata management mechanisms internally, they still >>>>> rely on the catalog for providing a unified interface for accessing and >>>>> manipulating metadata across different storage formats. >>>>> >>>>> "Another thing is that if I understand correctly, and I might be >>>>> totally wrong here, the internal spark catalog is a local installation of >>>>> hive metastore anyway, so I'm not sure what the catalog has to do with >>>>> anything" >>>>> >>>>> .I don't understand this. Do you mean a Derby database? >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* The information provided is correct to the best of my >>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>> expert opinions (Werner >>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>> >>>>> >>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks for the detailed answer. >>>>>> The thing I'm missing is this: let's say that the output format I >>>>>> choose is delta lake or iceberg or whatever format that uses parquet. >>>>>> Where >>>>>> does the catalog implementation (which holds metadata afaik, same >>>>>> metadata >>>>>> that iceberg and delta lake save for their tables about their columns) >>>>>> comes into play and why should it affect performance? >>>>>> Another thing is that if I understand correctly, and I might be >>>>>> totally wrong here, the internal spark catalog is a local installation of >>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with >>>>>> anything. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com>: >>>>>> >>>>>>> My take regarding your question is that your mileage varies so to >>>>>>> speak. >>>>>>> >>>>>>> 1) Hive provides a more mature and widely adopted catalog solution >>>>>>> that integrates well with other components in the Hadoop ecosystem, >>>>>>> such as >>>>>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), >>>>>>> using >>>>>>> Hive may offer better compatibility and interoperability. >>>>>>> 2) Hive provides a SQL-like interface that is familiar to users who >>>>>>> are accustomed to traditional RDBMs. If your use case involves complex >>>>>>> SQL >>>>>>> queries or existing SQL-based workflows, using Hive may be advantageous. >>>>>>> 3) If you are looking for performance, spark's native catalog tends >>>>>>> to offer better performance for certain workloads, particularly those >>>>>>> that >>>>>>> involve iterative processing or complex data transformations.(my >>>>>>> understanding). Spark's in-memory processing capabilities and >>>>>>> optimizations >>>>>>> make it well-suited for interactive analytics and machine learning >>>>>>> tasks.(my favourite) >>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark for >>>>>>> data processing and analytics, using Spark's native catalog may simplify >>>>>>> workflow management and reduce overhead, Spark's tight integration with >>>>>>> its catalog allows for seamless interaction with Spark applications and >>>>>>> libraries. >>>>>>> 5) There seems to be some similarity with spark catalog and >>>>>>> Databricks unity catalog, so that may favour the choice. >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Mich Talebzadeh, >>>>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>>>> London >>>>>>> United Kingdom >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* The information provided is correct to the best of my >>>>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>> expert opinions (Werner >>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>> >>>>>>> >>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I will also appreciate some material that describes the differences >>>>>>>> between Spark native tables vs hive tables and why each should be >>>>>>>> used... >>>>>>>> >>>>>>>> Thanks >>>>>>>> Nimrod >>>>>>>> >>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com>: >>>>>>>> >>>>>>>>> I see a statement made as below and I quote >>>>>>>>> >>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of this >>>>>>>>> configuration from `true` to `false` to use Spark native tables >>>>>>>>> because >>>>>>>>> we support better." >>>>>>>>> >>>>>>>>> Can you please elaborate on the above specifically with regard to >>>>>>>>> the phrase ".. because >>>>>>>>> we support better." >>>>>>>>> >>>>>>>>> Are you referring to the performance of Spark catalog (I believe >>>>>>>>> it is internal) or integration with Spark? >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> Mich Talebzadeh, >>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>> FinCrime >>>>>>>>> London >>>>>>>>> United Kingdom >>>>>>>>> >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>>> note >>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>>>> expert opinions (Werner >>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> +1 >>>>>>>>>> >>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> +1 >>>>>>>>>>> >>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Kent Yao >>>>>>>>>>> >>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道: >>>>>>>>>>> > >>>>>>>>>>> > Hi, All. >>>>>>>>>>> > >>>>>>>>>>> > It's great to see community activities to polish 4.0.0 more >>>>>>>>>>> and more. >>>>>>>>>>> > Thank you all. >>>>>>>>>>> > >>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from >>>>>>>>>>> the subtasks >>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0), >>>>>>>>>>> > >>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122 >>>>>>>>>>> > Set `spark.sql.legacy.createHiveTableByDefault` to `false` >>>>>>>>>>> by default >>>>>>>>>>> > >>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax >>>>>>>>>>> without >>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` >>>>>>>>>>> table. >>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value of >>>>>>>>>>> this >>>>>>>>>>> > configuration from `true` to `false` to use Spark native >>>>>>>>>>> tables because >>>>>>>>>>> > we support better. >>>>>>>>>>> > >>>>>>>>>>> > In other words, Spark will use the value of >>>>>>>>>>> `spark.sql.sources.default` >>>>>>>>>>> > as the table provider instead of `Hive` like the other Spark >>>>>>>>>>> APIs. Of course, >>>>>>>>>>> > the users can get all the legacy behavior by setting back to >>>>>>>>>>> `true`. >>>>>>>>>>> > >>>>>>>>>>> > Historically, this behavior change was merged once at Apache >>>>>>>>>>> Spark 3.0.0 >>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during the >>>>>>>>>>> 3.0.0 RC period. >>>>>>>>>>> > >>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for >>>>>>>>>>> CREATE TABLE >>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default >>>>>>>>>>> datasource as >>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>> > >>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this >>>>>>>>>>> and defined it >>>>>>>>>>> > as one of legacy behavior via this configuration via reused >>>>>>>>>>> ID, SPARK-30098. >>>>>>>>>>> > >>>>>>>>>>> > 2020-12-01: >>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 >>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default >>>>>>>>>>> datasource as >>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>> > >>>>>>>>>>> > Last year, we received two additional requests twice to switch >>>>>>>>>>> this because >>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the >>>>>>>>>>> future direction. >>>>>>>>>>> > >>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea. >>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > WDYT? The technical scope is defined in the following PR which >>>>>>>>>>> is one line of main >>>>>>>>>>> > code, one line of migration guide, and a few lines of test >>>>>>>>>>> code. >>>>>>>>>>> > >>>>>>>>>>> > - https://github.com/apache/spark/pull/46207 >>>>>>>>>>> > >>>>>>>>>>> > Dongjoon. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>> >>>>>>>>>>>