+1 On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek <ofek.nim...@gmail.com> wrote:
> Of course, I can't think of a scenario of thousands of tables with single > in memory Spark cluster with in memory catalog. > Thanks for the help! > > בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh < > mich.talebza...@gmail.com>: > >> >> >> Agreed. In scenarios where most of the interactions with the catalog are >> related to query planning, saving and metadata management, the choice of >> catalog implementation may have less impact on query runtime performance. >> This is because the time spent on metadata operations is generally >> minimal compared to the time spent on actual data fetching, processing, and >> computation. >> However, if we consider scalability and reliability concerns, especially >> as the size and complexity of data and query workload grow. While an >> in-memory catalog may offer excellent performance for smaller workloads, >> it will face limitations in handling larger-scale deployments with >> thousands of tables, partitions, and users. Additionally, durability and >> persistence are crucial considerations, particularly in production >> environments where data integrity >> and availability are crucial. In-memory catalog implementations may lack >> durability, meaning that metadata changes could be lost in the event of a >> system failure or restart. Therefore, while in-memory catalog >> implementations can provide speed and efficiency for certain use cases, we >> ought to consider the requirements for scalability, reliability, and data >> durability when choosing a catalog solution for production deployments. In >> many cases, a combination of in-memory and disk-based catalog solutions may >> offer the best balance of performance and resilience for demanding large >> scale workloads. >> >> >> HTH >> >> >> Mich Talebzadeh, >> Technologist | Architect | Data Engineer | Generative AI | FinCrime >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* The information provided is correct to the best of my >> knowledge but of course cannot be guaranteed . It is essential to note >> that, as with any advice, quote "one test result is worth one-thousand >> expert opinions (Werner >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >> >> >> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com> wrote: >> >>> Of course, but it's in memory and not persisted which is much faster, >>> and as I said- I believe that most of the interaction with it is during the >>> planning and save and not actual query run operations, and they are short >>> and minimal compared to data fetching and manipulation so I don't believe >>> it will have big impact on query run... >>> >>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh < >>> mich.talebza...@gmail.com>: >>> >>>> Well, I will be surprised because Derby database is single threaded and >>>> won't be much of a use here. >>>> >>>> Most Hive metastore in the commercial world utilise postgres or Oracle >>>> for metastore that are battle proven, replicated and backed up. >>>> >>>> Mich Talebzadeh, >>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* The information provided is correct to the best of my >>>> knowledge but of course cannot be guaranteed . It is essential to note >>>> that, as with any advice, quote "one test result is worth one-thousand >>>> expert opinions (Werner >>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>> >>>> >>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com> >>>> wrote: >>>> >>>>> Yes, in memory hive catalog backed by local Derby DB. >>>>> And again, I presume that most metadata related parts are during >>>>> planning and not actual run, so I don't see why it should strongly affect >>>>> query performance. >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh < >>>>> mich.talebza...@gmail.com>: >>>>> >>>>>> With regard to your point below >>>>>> >>>>>> "The thing I'm missing is this: let's say that the output format I >>>>>> choose is delta lake or iceberg or whatever format that uses parquet. >>>>>> Where >>>>>> does the catalog implementation (which holds metadata afaik, same >>>>>> metadata >>>>>> that iceberg and delta lake save for their tables about their columns) >>>>>> comes into play and why should it affect performance? " >>>>>> >>>>>> The catalog implementation comes into play regardless of the output >>>>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is >>>>>> responsible for managing metadata about the datasets, tables, schemas, >>>>>> and >>>>>> other objects stored in aforementioned formats. Even though Delta Lake >>>>>> and >>>>>> Iceberg have their metadata management mechanisms internally, they still >>>>>> rely on the catalog for providing a unified interface for accessing and >>>>>> manipulating metadata across different storage formats. >>>>>> >>>>>> "Another thing is that if I understand correctly, and I might be >>>>>> totally wrong here, the internal spark catalog is a local installation of >>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with >>>>>> anything" >>>>>> >>>>>> .I don't understand this. Do you mean a Derby database? >>>>>> >>>>>> HTH >>>>>> >>>>>> Mich Talebzadeh, >>>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* The information provided is correct to the best of my >>>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>> expert opinions (Werner >>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>> >>>>>> >>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the detailed answer. >>>>>>> The thing I'm missing is this: let's say that the output format I >>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. >>>>>>> Where >>>>>>> does the catalog implementation (which holds metadata afaik, same >>>>>>> metadata >>>>>>> that iceberg and delta lake save for their tables about their columns) >>>>>>> comes into play and why should it affect performance? >>>>>>> Another thing is that if I understand correctly, and I might be >>>>>>> totally wrong here, the internal spark catalog is a local installation >>>>>>> of >>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with >>>>>>> anything. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> >>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com>: >>>>>>> >>>>>>>> My take regarding your question is that your mileage varies so to >>>>>>>> speak. >>>>>>>> >>>>>>>> 1) Hive provides a more mature and widely adopted catalog solution >>>>>>>> that integrates well with other components in the Hadoop ecosystem, >>>>>>>> such as >>>>>>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), >>>>>>>> using >>>>>>>> Hive may offer better compatibility and interoperability. >>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users who >>>>>>>> are accustomed to traditional RDBMs. If your use case involves complex >>>>>>>> SQL >>>>>>>> queries or existing SQL-based workflows, using Hive may be >>>>>>>> advantageous. >>>>>>>> 3) If you are looking for performance, spark's native catalog tends >>>>>>>> to offer better performance for certain workloads, particularly those >>>>>>>> that >>>>>>>> involve iterative processing or complex data transformations.(my >>>>>>>> understanding). Spark's in-memory processing capabilities and >>>>>>>> optimizations >>>>>>>> make it well-suited for interactive analytics and machine learning >>>>>>>> tasks.(my favourite) >>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark for >>>>>>>> data processing and analytics, using Spark's native catalog may >>>>>>>> simplify >>>>>>>> workflow management and reduce overhead, Spark's tight integration >>>>>>>> with >>>>>>>> its catalog allows for seamless interaction with Spark applications and >>>>>>>> libraries. >>>>>>>> 5) There seems to be some similarity with spark catalog and >>>>>>>> Databricks unity catalog, so that may favour the choice. >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> Mich Talebzadeh, >>>>>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>>>>> London >>>>>>>> United Kingdom >>>>>>>> >>>>>>>> >>>>>>>> view my Linkedin profile >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> >>>>>>>> >>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>> note >>>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>>> expert opinions (Werner >>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>> >>>>>>>> >>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I will also appreciate some material that describes the >>>>>>>>> differences between Spark native tables vs hive tables and why each >>>>>>>>> should >>>>>>>>> be used... >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Nimrod >>>>>>>>> >>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh < >>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>> >>>>>>>>>> I see a statement made as below and I quote >>>>>>>>>> >>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of >>>>>>>>>> this >>>>>>>>>> configuration from `true` to `false` to use Spark native tables >>>>>>>>>> because >>>>>>>>>> we support better." >>>>>>>>>> >>>>>>>>>> Can you please elaborate on the above specifically with regard to >>>>>>>>>> the phrase ".. because >>>>>>>>>> we support better." >>>>>>>>>> >>>>>>>>>> Are you referring to the performance of Spark catalog (I believe >>>>>>>>>> it is internal) or integration with Spark? >>>>>>>>>> >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> Mich Talebzadeh, >>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>> FinCrime >>>>>>>>>> London >>>>>>>>>> United Kingdom >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> view my Linkedin profile >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>>>> note >>>>>>>>>> that, as with any advice, quote "one test result is worth >>>>>>>>>> one-thousand >>>>>>>>>> expert opinions (Werner >>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> +1 >>>>>>>>>>> >>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> +1 >>>>>>>>>>>> >>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Kent Yao >>>>>>>>>>>> >>>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道: >>>>>>>>>>>> > >>>>>>>>>>>> > Hi, All. >>>>>>>>>>>> > >>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 more >>>>>>>>>>>> and more. >>>>>>>>>>>> > Thank you all. >>>>>>>>>>>> > >>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from >>>>>>>>>>>> the subtasks >>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0), >>>>>>>>>>>> > >>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122 >>>>>>>>>>>> > Set `spark.sql.legacy.createHiveTableByDefault` to `false` >>>>>>>>>>>> by default >>>>>>>>>>>> > >>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax >>>>>>>>>>>> without >>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` >>>>>>>>>>>> table. >>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value of >>>>>>>>>>>> this >>>>>>>>>>>> > configuration from `true` to `false` to use Spark native >>>>>>>>>>>> tables because >>>>>>>>>>>> > we support better. >>>>>>>>>>>> > >>>>>>>>>>>> > In other words, Spark will use the value of >>>>>>>>>>>> `spark.sql.sources.default` >>>>>>>>>>>> > as the table provider instead of `Hive` like the other Spark >>>>>>>>>>>> APIs. Of course, >>>>>>>>>>>> > the users can get all the legacy behavior by setting back to >>>>>>>>>>>> `true`. >>>>>>>>>>>> > >>>>>>>>>>>> > Historically, this behavior change was merged once at Apache >>>>>>>>>>>> Spark 3.0.0 >>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during the >>>>>>>>>>>> 3.0.0 RC period. >>>>>>>>>>>> > >>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider >>>>>>>>>>>> for CREATE TABLE >>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default >>>>>>>>>>>> datasource as >>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>> > >>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this >>>>>>>>>>>> and defined it >>>>>>>>>>>> > as one of legacy behavior via this configuration via reused >>>>>>>>>>>> ID, SPARK-30098. >>>>>>>>>>>> > >>>>>>>>>>>> > 2020-12-01: >>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 >>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default >>>>>>>>>>>> datasource as >>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>> > >>>>>>>>>>>> > Last year, we received two additional requests twice to >>>>>>>>>>>> switch this because >>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the >>>>>>>>>>>> future direction. >>>>>>>>>>>> > >>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea. >>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR >>>>>>>>>>>> which is one line of main >>>>>>>>>>>> > code, one line of migration guide, and a few lines of test >>>>>>>>>>>> code. >>>>>>>>>>>> > >>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207 >>>>>>>>>>>> > >>>>>>>>>>>> > Dongjoon. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>>> >>>>>>>>>>>>