Yes, in memory hive catalog backed by local Derby DB. And again, I presume that most metadata related parts are during planning and not actual run, so I don't see why it should strongly affect query performance.
Thanks, בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh < mich.talebza...@gmail.com>: > With regard to your point below > > "The thing I'm missing is this: let's say that the output format I choose > is delta lake or iceberg or whatever format that uses parquet. Where does > the catalog implementation (which holds metadata afaik, same metadata that > iceberg and delta lake save for their tables about their columns) comes > into play and why should it affect performance? " > > The catalog implementation comes into play regardless of the output format > chosen (Delta Lake, Iceberg, Parquet, etc.) because it is responsible for > managing metadata about the datasets, tables, schemas, and other objects > stored in aforementioned formats. Even though Delta Lake and Iceberg have > their metadata management mechanisms internally, they still rely on the > catalog for providing a unified interface for accessing and manipulating > metadata across different storage formats. > > "Another thing is that if I understand correctly, and I might be totally > wrong here, the internal spark catalog is a local installation of hive > metastore anyway, so I'm not sure what the catalog has to do with anything" > > .I don't understand this. Do you mean a Derby database? > > HTH > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative AI | FinCrime > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> wrote: > >> Thanks for the detailed answer. >> The thing I'm missing is this: let's say that the output format I choose >> is delta lake or iceberg or whatever format that uses parquet. Where does >> the catalog implementation (which holds metadata afaik, same metadata that >> iceberg and delta lake save for their tables about their columns) comes >> into play and why should it affect performance? >> Another thing is that if I understand correctly, and I might be totally >> wrong here, the internal spark catalog is a local installation of hive >> metastore anyway, so I'm not sure what the catalog has to do with anything. >> >> Thanks! >> >> >> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh < >> mich.talebza...@gmail.com>: >> >>> My take regarding your question is that your mileage varies so to speak. >>> >>> 1) Hive provides a more mature and widely adopted catalog solution that >>> integrates well with other components in the Hadoop ecosystem, such as >>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using >>> Hive may offer better compatibility and interoperability. >>> 2) Hive provides a SQL-like interface that is familiar to users who are >>> accustomed to traditional RDBMs. If your use case involves complex SQL >>> queries or existing SQL-based workflows, using Hive may be advantageous. >>> 3) If you are looking for performance, spark's native catalog tends to >>> offer better performance for certain workloads, particularly those that >>> involve iterative processing or complex data transformations.(my >>> understanding). Spark's in-memory processing capabilities and optimizations >>> make it well-suited for interactive analytics and machine learning >>> tasks.(my favourite) >>> 4) Integration with Spark Workflows: If you primarily use Spark for data >>> processing and analytics, using Spark's native catalog may simplify >>> workflow management and reduce overhead, Spark's tight integration with >>> its catalog allows for seamless interaction with Spark applications and >>> libraries. >>> 5) There seems to be some similarity with spark catalog and >>> Databricks unity catalog, so that may favour the choice. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* The information provided is correct to the best of my >>> knowledge but of course cannot be guaranteed . It is essential to note >>> that, as with any advice, quote "one test result is worth one-thousand >>> expert opinions (Werner >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>> >>> >>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com> wrote: >>> >>>> I will also appreciate some material that describes the differences >>>> between Spark native tables vs hive tables and why each should be used... >>>> >>>> Thanks >>>> Nimrod >>>> >>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh < >>>> mich.talebza...@gmail.com>: >>>> >>>>> I see a statement made as below and I quote >>>>> >>>>> "The proposal of SPARK-46122 is to switch the default value of this >>>>> configuration from `true` to `false` to use Spark native tables because >>>>> we support better." >>>>> >>>>> Can you please elaborate on the above specifically with regard to the >>>>> phrase ".. because >>>>> we support better." >>>>> >>>>> Are you referring to the performance of Spark catalog (I believe it is >>>>> internal) or integration with Spark? >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* The information provided is correct to the best of my >>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>> expert opinions (Werner >>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>> >>>>> >>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com> wrote: >>>>> >>>>>> +1 >>>>>> >>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> wrote: >>>>>> >>>>>>> +1 >>>>>>> >>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444. >>>>>>> >>>>>>> Thanks, >>>>>>> Kent Yao >>>>>>> >>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道: >>>>>>> > >>>>>>> > Hi, All. >>>>>>> > >>>>>>> > It's great to see community activities to polish 4.0.0 more and >>>>>>> more. >>>>>>> > Thank you all. >>>>>>> > >>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from the >>>>>>> subtasks >>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0), >>>>>>> > >>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122 >>>>>>> > Set `spark.sql.legacy.createHiveTableByDefault` to `false` by >>>>>>> default >>>>>>> > >>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax >>>>>>> without >>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` table. >>>>>>> > The proposal of SPARK-46122 is to switch the default value of this >>>>>>> > configuration from `true` to `false` to use Spark native tables >>>>>>> because >>>>>>> > we support better. >>>>>>> > >>>>>>> > In other words, Spark will use the value of >>>>>>> `spark.sql.sources.default` >>>>>>> > as the table provider instead of `Hive` like the other Spark APIs. >>>>>>> Of course, >>>>>>> > the users can get all the legacy behavior by setting back to >>>>>>> `true`. >>>>>>> > >>>>>>> > Historically, this behavior change was merged once at Apache Spark >>>>>>> 3.0.0 >>>>>>> > preparation via SPARK-30098 already, but reverted during the 3.0.0 >>>>>>> RC period. >>>>>>> > >>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for >>>>>>> CREATE TABLE >>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource >>>>>>> as >>>>>>> > provider for CREATE TABLE command >>>>>>> > >>>>>>> > At Apache Spark 3.1.0, we had another discussion about this and >>>>>>> defined it >>>>>>> > as one of legacy behavior via this configuration via reused ID, >>>>>>> SPARK-30098. >>>>>>> > >>>>>>> > 2020-12-01: >>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 >>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default >>>>>>> datasource as >>>>>>> > provider for CREATE TABLE command >>>>>>> > >>>>>>> > Last year, we received two additional requests twice to switch >>>>>>> this because >>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the >>>>>>> future direction. >>>>>>> > >>>>>>> > 2023-02-27: SPARK-42603 as an independent idea. >>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea >>>>>>> > >>>>>>> > >>>>>>> > WDYT? The technical scope is defined in the following PR which is >>>>>>> one line of main >>>>>>> > code, one line of migration guide, and a few lines of test code. >>>>>>> > >>>>>>> > - https://github.com/apache/spark/pull/46207 >>>>>>> > >>>>>>> > Dongjoon. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>> >>>>>>>