Mich, It is a legacy config we should get rid of in the end, and it has been tested in production for very long time. Spark should create a Spark table by default.
On Tue, Apr 30, 2024 at 5:38 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Your point > > ".. t's a surprise to me to see that someone has different positions in a > very short period of time in the community...." > > Well, I have been with Spark since 2015 and this is the article in the > medium dated February 7, 2016 with regard to both Hive and Spark and also > presented in Hortonworks meet-up. > > Hive on Spark Engine Versus Spark Using Hive Metastore > <https://www.linkedin.com/pulse/hive-spark-engine-versus-using-metastore-mich-talebzadeh-ph-d-/> > > With regard to why I castred +1 votre for one and -1 for the other, I > think it is my prerogative how I vote and we leave it at that., > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative AI | FinCrime > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Mon, 29 Apr 2024 at 17:32, Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> It's a surprise to me to see that someone has different positions >> in a very short period of time in the community. >> >> Mitch casted +1 for SPARK-44444 and -1 for SPARK-46122. >> - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc >> - https://lists.apache.org/thread/x09gynt90v3hh5sql1gt9dlcn6m6699p >> >> To Mitch, what I'm interested in is the following specifically. >> > 2. Compatibility: Changing the default behavior could potentially >> > break existing workflows or pipelines that rely on the current >> behavior. >> >> May I ask you the following questions? >> A. What is the purpose of the migration guide in the ASF projects? >> >> B. Do you claim that there is incompatibility when you have >> spark.sql.legacy.createHiveTableByDefault=true which is described >> in the migration guide? >> >> C. Do you know that ANSI SQL has new RUNTIME exceptions >> which are harder than SPARK-46122? >> >> D. Or, did you cast +1 for SPARK-44444 because >> you think there is no breaking change by default? >> >> I guess there is some misunderstanding on the proposal. >> >> Thanks, >> Dongjoon. >> >> >> On Fri, Apr 26, 2024 at 12:05 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi, >>> >>> I would like to add a side note regarding the discussion process and the >>> current title of the proposal. The title '[DISCUSS] SPARK-46122: Set >>> spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific >>> configuration parameter, which might lead some participants to overlook its >>> broader implications (as was raised by myself and others). I believe that a >>> more descriptive title, encompassing the broader discussion on default >>> behaviours for creating Hive tables in Spark SQL, could enable greater >>> engagement within the community. This is an important topic that deserves >>> thorough consideration. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* The information provided is correct to the best of my >>> knowledge but of course cannot be guaranteed . It is essential to note >>> that, as with any advice, quote "one test result is worth one-thousand >>> expert opinions (Werner >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>> >>> >>> On Fri, 26 Apr 2024 at 07:13, L. C. Hsieh <vii...@gmail.com> wrote: >>> >>>> +1 >>>> >>>> On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang <yumw...@apache.org> wrote: >>>> >>>>> +1 >>>>> >>>>> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek <ofek.nim...@gmail.com> >>>>> wrote: >>>>> >>>>>> Of course, I can't think of a scenario of thousands of tables with >>>>>> single in memory Spark cluster with in memory catalog. >>>>>> Thanks for the help! >>>>>> >>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com>: >>>>>> >>>>>>> >>>>>>> >>>>>>> Agreed. In scenarios where most of the interactions with the catalog >>>>>>> are related to query planning, saving and metadata management, the >>>>>>> choice >>>>>>> of catalog implementation may have less impact on query runtime >>>>>>> performance. >>>>>>> This is because the time spent on metadata operations is generally >>>>>>> minimal compared to the time spent on actual data fetching, processing, >>>>>>> and >>>>>>> computation. >>>>>>> However, if we consider scalability and reliability concerns, >>>>>>> especially as the size and complexity of data and query workload grow. >>>>>>> While an in-memory catalog may offer excellent performance for smaller >>>>>>> workloads, >>>>>>> it will face limitations in handling larger-scale deployments with >>>>>>> thousands of tables, partitions, and users. Additionally, durability and >>>>>>> persistence are crucial considerations, particularly in production >>>>>>> environments where data integrity >>>>>>> and availability are crucial. In-memory catalog implementations may >>>>>>> lack durability, meaning that metadata changes could be lost in the >>>>>>> event >>>>>>> of a system failure or restart. Therefore, while in-memory catalog >>>>>>> implementations can provide speed and efficiency for certain use cases, >>>>>>> we >>>>>>> ought to consider the requirements for scalability, reliability, and >>>>>>> data >>>>>>> durability when choosing a catalog solution for production deployments. >>>>>>> In >>>>>>> many cases, a combination of in-memory and disk-based catalog solutions >>>>>>> may >>>>>>> offer the best balance of performance and resilience for demanding large >>>>>>> scale workloads. >>>>>>> >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> >>>>>>> Mich Talebzadeh, >>>>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>>>> London >>>>>>> United Kingdom >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* The information provided is correct to the best of my >>>>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>> expert opinions (Werner >>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>> >>>>>>> >>>>>>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Of course, but it's in memory and not persisted which is much >>>>>>>> faster, and as I said- I believe that most of the interaction with it >>>>>>>> is >>>>>>>> during the planning and save and not actual query run operations, and >>>>>>>> they >>>>>>>> are short and minimal compared to data fetching and manipulation so I >>>>>>>> don't >>>>>>>> believe it will have big impact on query run... >>>>>>>> >>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com>: >>>>>>>> >>>>>>>>> Well, I will be surprised because Derby database is single >>>>>>>>> threaded and won't be much of a use here. >>>>>>>>> >>>>>>>>> Most Hive metastore in the commercial world utilise postgres or >>>>>>>>> Oracle for metastore that are battle proven, replicated and backed up. >>>>>>>>> >>>>>>>>> Mich Talebzadeh, >>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>> FinCrime >>>>>>>>> London >>>>>>>>> United Kingdom >>>>>>>>> >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>>> note >>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>>>> expert opinions (Werner >>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Yes, in memory hive catalog backed by local Derby DB. >>>>>>>>>> And again, I presume that most metadata related parts are during >>>>>>>>>> planning and not actual run, so I don't see why it should strongly >>>>>>>>>> affect >>>>>>>>>> query performance. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh < >>>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>>> >>>>>>>>>>> With regard to your point below >>>>>>>>>>> >>>>>>>>>>> "The thing I'm missing is this: let's say that the output format >>>>>>>>>>> I choose is delta lake or iceberg or whatever format that uses >>>>>>>>>>> parquet. >>>>>>>>>>> Where does the catalog implementation (which holds metadata afaik, >>>>>>>>>>> same >>>>>>>>>>> metadata that iceberg and delta lake save for their tables about >>>>>>>>>>> their >>>>>>>>>>> columns) comes into play and why should it affect performance? " >>>>>>>>>>> >>>>>>>>>>> The catalog implementation comes into play regardless of the >>>>>>>>>>> output format chosen (Delta Lake, Iceberg, Parquet, etc.) because >>>>>>>>>>> it is >>>>>>>>>>> responsible for managing metadata about the datasets, tables, >>>>>>>>>>> schemas, and >>>>>>>>>>> other objects stored in aforementioned formats. Even though Delta >>>>>>>>>>> Lake and >>>>>>>>>>> Iceberg have their metadata management mechanisms internally, they >>>>>>>>>>> still >>>>>>>>>>> rely on the catalog for providing a unified interface for accessing >>>>>>>>>>> and >>>>>>>>>>> manipulating metadata across different storage formats. >>>>>>>>>>> >>>>>>>>>>> "Another thing is that if I understand correctly, and I might be >>>>>>>>>>> totally wrong here, the internal spark catalog is a local >>>>>>>>>>> installation of >>>>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do >>>>>>>>>>> with >>>>>>>>>>> anything" >>>>>>>>>>> >>>>>>>>>>> .I don't understand this. Do you mean a Derby database? >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>>> FinCrime >>>>>>>>>>> London >>>>>>>>>>> United Kingdom >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> view my Linkedin profile >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Disclaimer:* The information provided is correct to the best >>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>> essential to >>>>>>>>>>> note that, as with any advice, quote "one test result is worth >>>>>>>>>>> one-thousand expert opinions (Werner >>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks for the detailed answer. >>>>>>>>>>>> The thing I'm missing is this: let's say that the output format >>>>>>>>>>>> I choose is delta lake or iceberg or whatever format that uses >>>>>>>>>>>> parquet. >>>>>>>>>>>> Where does the catalog implementation (which holds metadata afaik, >>>>>>>>>>>> same >>>>>>>>>>>> metadata that iceberg and delta lake save for their tables about >>>>>>>>>>>> their >>>>>>>>>>>> columns) comes into play and why should it affect performance? >>>>>>>>>>>> Another thing is that if I understand correctly, and I might be >>>>>>>>>>>> totally wrong here, the internal spark catalog is a local >>>>>>>>>>>> installation of >>>>>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do >>>>>>>>>>>> with >>>>>>>>>>>> anything. >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh < >>>>>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>>>>> >>>>>>>>>>>>> My take regarding your question is that your mileage varies so >>>>>>>>>>>>> to speak. >>>>>>>>>>>>> >>>>>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog >>>>>>>>>>>>> solution that integrates well with other components in the Hadoop >>>>>>>>>>>>> ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop >>>>>>>>>>>>> centric S(say >>>>>>>>>>>>> on-premise), using Hive may offer better compatibility and >>>>>>>>>>>>> interoperability. >>>>>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to >>>>>>>>>>>>> users who are accustomed to traditional RDBMs. If your use case >>>>>>>>>>>>> involves >>>>>>>>>>>>> complex SQL queries or existing SQL-based workflows, using Hive >>>>>>>>>>>>> may be >>>>>>>>>>>>> advantageous. >>>>>>>>>>>>> 3) If you are looking for performance, spark's native catalog >>>>>>>>>>>>> tends to offer better performance for certain workloads, >>>>>>>>>>>>> particularly those >>>>>>>>>>>>> that involve iterative processing or complex data >>>>>>>>>>>>> transformations.(my >>>>>>>>>>>>> understanding). Spark's in-memory processing capabilities and >>>>>>>>>>>>> optimizations >>>>>>>>>>>>> make it well-suited for interactive analytics and machine learning >>>>>>>>>>>>> tasks.(my favourite) >>>>>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use >>>>>>>>>>>>> Spark for data processing and analytics, using Spark's native >>>>>>>>>>>>> catalog may >>>>>>>>>>>>> simplify workflow management and reduce overhead, Spark's tight >>>>>>>>>>>>> integration with its catalog allows for seamless interaction with >>>>>>>>>>>>> Spark >>>>>>>>>>>>> applications and libraries. >>>>>>>>>>>>> 5) There seems to be some similarity with spark catalog and >>>>>>>>>>>>> Databricks unity catalog, so that may favour the choice. >>>>>>>>>>>>> >>>>>>>>>>>>> HTH >>>>>>>>>>>>> >>>>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>>>>> FinCrime >>>>>>>>>>>>> London >>>>>>>>>>>>> United Kingdom >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> view my Linkedin profile >>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best >>>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>>>> essential to >>>>>>>>>>>>> note that, as with any advice, quote "one test result is >>>>>>>>>>>>> worth one-thousand expert opinions (Werner >>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek < >>>>>>>>>>>>> ofek.nim...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I will also appreciate some material that describes the >>>>>>>>>>>>>> differences between Spark native tables vs hive tables and why >>>>>>>>>>>>>> each should >>>>>>>>>>>>>> be used... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Nimrod >>>>>>>>>>>>>> >>>>>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh < >>>>>>>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I see a statement made as below and I quote >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value >>>>>>>>>>>>>>> of this >>>>>>>>>>>>>>> configuration from `true` to `false` to use Spark native >>>>>>>>>>>>>>> tables because >>>>>>>>>>>>>>> we support better." >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you please elaborate on the above specifically with >>>>>>>>>>>>>>> regard to the phrase ".. because >>>>>>>>>>>>>>> we support better." >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Are you referring to the performance of Spark catalog (I >>>>>>>>>>>>>>> believe it is internal) or integration with Spark? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> HTH >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>>>>>>> FinCrime >>>>>>>>>>>>>>> London >>>>>>>>>>>>>>> United Kingdom >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> view my Linkedin profile >>>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the >>>>>>>>>>>>>>> best of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>>>>>> essential >>>>>>>>>>>>>>> to note that, as with any advice, quote "one test result is >>>>>>>>>>>>>>> worth one-thousand expert opinions (Werner >>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan < >>>>>>>>>>>>>>> cloud0...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Kent Yao >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 >>>>>>>>>>>>>>>>> 14:39写道: >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > Hi, All. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 >>>>>>>>>>>>>>>>> more and more. >>>>>>>>>>>>>>>>> > Thank you all. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you >>>>>>>>>>>>>>>>> from the subtasks >>>>>>>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0), >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122 >>>>>>>>>>>>>>>>> > Set `spark.sql.legacy.createHiveTableByDefault` to >>>>>>>>>>>>>>>>> `false` by default >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL >>>>>>>>>>>>>>>>> syntax without >>>>>>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to >>>>>>>>>>>>>>>>> `Hive` table. >>>>>>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default >>>>>>>>>>>>>>>>> value of this >>>>>>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native >>>>>>>>>>>>>>>>> tables because >>>>>>>>>>>>>>>>> > we support better. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > In other words, Spark will use the value of >>>>>>>>>>>>>>>>> `spark.sql.sources.default` >>>>>>>>>>>>>>>>> > as the table provider instead of `Hive` like the other >>>>>>>>>>>>>>>>> Spark APIs. Of course, >>>>>>>>>>>>>>>>> > the users can get all the legacy behavior by setting >>>>>>>>>>>>>>>>> back to `true`. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > Historically, this behavior change was merged once at >>>>>>>>>>>>>>>>> Apache Spark 3.0.0 >>>>>>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during >>>>>>>>>>>>>>>>> the 3.0.0 RC period. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as >>>>>>>>>>>>>>>>> provider for CREATE TABLE >>>>>>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default >>>>>>>>>>>>>>>>> datasource as >>>>>>>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about >>>>>>>>>>>>>>>>> this and defined it >>>>>>>>>>>>>>>>> > as one of legacy behavior via this configuration via >>>>>>>>>>>>>>>>> reused ID, SPARK-30098. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > 2020-12-01: >>>>>>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 >>>>>>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use >>>>>>>>>>>>>>>>> default datasource as >>>>>>>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > Last year, we received two additional requests twice to >>>>>>>>>>>>>>>>> switch this because >>>>>>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for >>>>>>>>>>>>>>>>> the future direction. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea. >>>>>>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 >>>>>>>>>>>>>>>>> idea >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR >>>>>>>>>>>>>>>>> which is one line of main >>>>>>>>>>>>>>>>> > code, one line of migration guide, and a few lines of >>>>>>>>>>>>>>>>> test code. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207 >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > Dongjoon. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>