Anything would be OK if the create table DDL provides a "clear way" to expect the table provider "before" they run the query. Great news that it doesn't require major rework - looking forward to the PR.
Thanks again to jump in and sort this out. - Jungtaek Lim (HeartSaVioR) On Fri, Mar 20, 2020 at 9:36 AM Ryan Blue <rb...@netflix.com> wrote: > I have an update to the parser that unifies the CREATE TABLE rules. It > took surprisingly little work to get the parser updated to produce > CreateTableStatement and CreateTableAsSelectStatement with the Hive info. > And the only fields I need to add to those statements were serde: > SerdeInfo and external: Boolean. > > From there, we can use the conversion rules to re-create the same Hive > command for v1 or pass the data as properties for v2. I’ll work on getting > this cleaned up and open a PR hopefully tomorrow. > > For the questions about how this gets converted to either a Spark or Hive > create table command, that is really up to analyzer rules and > configuration. With my changes, it is no longer determined by the parser: > the parser just produces a node that includes all of the user options and > Spark decides what to do with that in the analyzer. Also, there's already > an option to convert Hive syntax to a Spark > command, spark.sql.hive.convertCTAS. > > rb > > On Thu, Mar 19, 2020 at 12:46 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Big +1 to have one single unified CREATE TABLE syntax. >> >> In general, we can say there are 2 ways to specify the table provider: >> USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually >> exclusive. If none is specified, it implicitly indicates USING >> defaultSource. >> >> I'm fine with a few special cases which can indicate the table provider, >> like EXTERNAL indicates Hive Serde table. A few thoughts: >> 1. SKEWED BY ...: We support it in Hive syntax just to fail it with a >> nice error message. We can support it in the unified syntax as well, and >> fail it. >> 2. PARTITIONED BY colTypeList: I think we can support it in the unified >> syntax. Just make sure it doesn't appear together with PARTITIONED BY >> transformList. >> 3. OPTIONS: We can either map it to Hive Serde properties, or let it >> indicate non-Hive tables. >> >> On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Thanks Nicholas for the side comment; you'll need to interpret "CREATE >>> TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it >>> may add the confusion. >>> >>> Ryan, thanks for the detailed analysis and proposal. That's what I would >>> like to see in discussion thread. >>> >>> I'm open to solutions which enable end users to specify their intention >>> properly - my main concern of SPARK-30098 is that it becomes unclear which >>> provider the query will use in create table unless USING provider is >>> explicitly specified. If the new proposal makes clear on this, that should >>> be better than now. >>> >>> Replying inline: >>> >>> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas < >>> nicholas.cham...@gmail.com> wrote: >>> >>>> Side comment: The current docs for CREATE TABLE >>>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description> >>>> add to the confusion by describing the Hive-compatible command as "CREATE >>>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are >>>> actually part of the syntax >>>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md> >>>> . >>>> >>>> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <rb...@netflix.com.invalid> >>>> wrote: >>>> >>>>> Jungtaek, it sounds like you consider the two rules to be separate >>>>> syntaxes with their own consistency rules. For example, if I am using the >>>>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition) >>>>> columns and requires types for those columns; if I’m using the Spark >>>>> syntax >>>>> rule with USING then PARTITIONED BY must reference existing columns >>>>> and cannot include types. >>>>> >>>>> I agree that this is confusing to users! We should fix it, but I don’t >>>>> think the right solution is to continue to have two rules with divergent >>>>> syntax. >>>>> >>>>> This is confusing to users because they don’t know anything about >>>>> separate parser rules. All the user sees is that sometimes PARTITION >>>>> BY requires types and sometimes it doesn’t. Yes, we could add a >>>>> keyword, HIVE, to signal that the syntax is borrowed from Hive for >>>>> that case, but that actually breaks queries that run in Hive. >>>>> >>>> That might less matter, because SPARK-30098 (and I guess your proposal >>> as well) enforces end users to add "USING HIVE" for their queries to enable >>> Hive provider in any way, even only when the query matches with rule 1 >>> (conditional). Once they decide to create Hive table, the query might have >>> to be changed, or they have to change the default provider, or they have to >>> enable legacy config. >>> >>> >>>> I think the right solution is to unify the two syntaxes. I don’t think >>>>> they are so different that it isn’t possible. Here are the differences I >>>>> see: >>>>> >>>>> - Only in Hive: >>>>> - EXTERNAL >>>>> - skewSpec: SKEWED BY ... >>>>> - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ... >>>>> - createFileFormat: STORED AS ... >>>>> - Only in Spark: >>>>> - OPTIONS property list >>>>> - Different syntax/interpretation: >>>>> - PARTITIONED BY transformList / PARTITIONED BY colTypeList >>>>> >>>>> ":" after column name is another one only supported in Hive, though >>> that's relatively minor to support it in unified syntax. >>> >>>> >>>>> - >>>>> >>>>> For the clauses that are supported in one but not the other, we can >>>>> add them to a unified rule as optional clauses. The AST builder would then >>>>> validate what makes sense or not (e.g., stored as with using or row format >>>>> delimited) and finally pass the remaining data on using the >>>>> CreateTableStatement. That statement would be handled like we do for >>>>> the Spark rule today, but with extra metadata to pass along. This is also >>>>> a >>>>> step toward being able to put Hive behind the DSv2 API because we’d be >>>>> able >>>>> to pass all of the Hive metadata clauses to the v2 catalog. >>>>> >>>>> The only difficult part is handling PARTITIONED BY. But in that case, >>>>> we can use two different syntaxes from the same CREATE TABLE rule. If >>>>> types are included, we use the Hive PARTITIONED BY syntax and convert >>>>> in the AST builder to normalize to a single representation. >>>>> >>>> The proposal looks promising - it may add some sort of complexity but >>> sounds like worth adding. >>> >>> One thing to make clear - in unified syntax we only rely on explicit >>> provider, or default provider, right? I would concern if the proposal >>> automatically uses Hive provider if Hive specific clauses are being used. >>> Yes as I said earlier it may make end users' query to be changed, but >>> better than uncertain. >>> >>> Btw, if the main purpose to add native syntax and change it by default >>> is to discontinue supporting Hive create table rule sooner, simply dropping >>> rule 2 with providing legacy config is still one of valid options I think. >>> >>> >>>> What do you both think? This would make the behavior more clear and >>>>> take a step toward getting rid of Hive-specific code. >>>>> >>>>> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim < >>>>> kabhwan.opensou...@gmail.com> wrote: >>>>> >>>>>> I'm trying to understand the reason you have been suggesting to keep >>>>>> the real thing unchanged but change doc instead. Could you please >>>>>> elaborate >>>>>> why? End users would blame us when they hit the case their query doesn't >>>>>> work as intended (1) and found the fact it's undocumented (2) and hard to >>>>>> understand even from the Spark codebase (3). >>>>>> >>>>>> For me, addressing the root issue adopting your suggestion would be >>>>>> "dropping the rule 2" and only supporting it with legacy config on. We >>>>>> would say to end users, you need to enable the legacy config to leverage >>>>>> Hive create table syntax, or just use beeline with Hive connected. >>>>>> >>>>>> But since we are even thinking about native syntax as a first class >>>>>> and dropping Hive one implicitly (hide in doc) or explicitly, does it >>>>>> really matter we require a marker (like "HIVE") in rule 2 and isolate it? >>>>>> It would have even less confusion than Spark 2.x, since we will require >>>>>> end >>>>>> users to fill the marker regarding Hive when creating Hive table, easier >>>>>> to >>>>>> classify than "USING provider". >>>>>> >>>>>> If we think native syntax would cover many cases end users have been >>>>>> creating Hive table in Spark (say, USING hive would simply work for >>>>>> them), >>>>>> I'm OK to drop the rule 2 and lead end users to enable the legacy config >>>>>> if >>>>>> really needed. If not, let's continue "fixing" the issue. >>>>>> >>>>>> (Another valid approach would be consolidating two rules into one, >>>>>> and defining support of parameters per provider, e.g. EXTERNAL, STORED >>>>>> AS, >>>>>> ROW FORMAT, etc. are only supported in Hive provider.) >>>>>> >>>>>> >>>>>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cloud0...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> The fact that we have 2 CREATE TABLE syntax is already confusing >>>>>>> many users. Shall we only document the native syntax? Then users don't >>>>>>> need >>>>>>> to worry about which rule their query fits and they don't need to spend >>>>>>> a >>>>>>> lot of time understanding the subtle difference between these 2 >>>>>>> syntaxes. >>>>>>> >>>>>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim < >>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>> >>>>>>>> A bit correction: the example I provided for vice versa is not >>>>>>>> really a correct case for vice versa. It's actually same case >>>>>>>> (intended to >>>>>>>> use rule 2 which is not default) but different result. >>>>>>>> >>>>>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim < >>>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>>> >>>>>>>>> My concern is that although we simply think about the change to >>>>>>>>> mark "USING provider" to be optional in rule 1, but in reality the >>>>>>>>> change >>>>>>>>> is most likely swapping the default rule for CREATE TABLE, which was >>>>>>>>> "rule >>>>>>>>> 2", and now "rule 1" (it would be the happy case of migration doc if >>>>>>>>> the >>>>>>>>> swap happens as intended), and there're still couple of things which >>>>>>>>> make >>>>>>>>> the query still fall into rule 2 which is non-trivial to reason about >>>>>>>>> and >>>>>>>>> also not easy to explain. >>>>>>>>> >>>>>>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into >>>>>>>>> rule 2 to simplify the problem statement, but they're not only the >>>>>>>>> case >>>>>>>>> - using col_name1:col_type1 would make the query fall into rule 2 >>>>>>>>> regardless of any other properties. >>>>>>>>> >>>>>>>>> What's the matter? In Spark 2.x, if end users want to use rule 1 >>>>>>>>> (which is not default) and specify the parameters which are only >>>>>>>>> available >>>>>>>>> in rule 1, it clearly requires "USING provider" - parser will throw >>>>>>>>> error >>>>>>>>> if there're any mistakes on specifying parameters. End users could set >>>>>>>>> intention clearly which rule the query should be bound. If the query >>>>>>>>> fails >>>>>>>>> to bind the rule as intended, it's simply denied. >>>>>>>>> >>>>>>>>> In Spark 3.x, parser may not help figuring out the case where end >>>>>>>>> users intend to use rule 2 (which is not default) but some mistake on >>>>>>>>> specifying parameters - it could just "silently" bound into rule 1 >>>>>>>>> and it >>>>>>>>> may be even executed without any error. Vice versa happens, but in >>>>>>>>> odd way >>>>>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird message >>>>>>>>> CREATE >>>>>>>>> EXTERNAL TABLE is not supported. >>>>>>>>> >>>>>>>>> It's deterministic for end users only if they're fully understand >>>>>>>>> the difference between the twos and also understand how Spark would >>>>>>>>> apply >>>>>>>>> the rules to make the query fall into one of the rule. I'm not sure >>>>>>>>> how we >>>>>>>>> can only improve documentation to make things be clear, but if the >>>>>>>>> approach >>>>>>>>> would be explaining the difference of rules and guide the tip to make >>>>>>>>> the >>>>>>>>> query be bound to the specific rule, the same could be applied to >>>>>>>>> parser >>>>>>>>> rule to address the root cause. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cloud0...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits >>>>>>>>>> both native and Hive syntax. I'm fine with some changes to make it >>>>>>>>>> less >>>>>>>>>> confusing, as long as the user-facing behavior doesn't change. For >>>>>>>>>> example, >>>>>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy >>>>>>>>>> config >>>>>>>>>> is false. >>>>>>>>>> >>>>>>>>>> I still don't get your point about what's the real problem to end >>>>>>>>>> users. There is no ambiguity as the behavior is deterministic, >>>>>>>>>> although we >>>>>>>>>> rely on optional fields and rule order which is bad. It's hard to >>>>>>>>>> document, >>>>>>>>>> but I don't think that's a big problem to end users. >>>>>>>>>> >>>>>>>>>> For the legacy config, it does make the implementation more >>>>>>>>>> complicated, but it's invisible to most end users (we don't document >>>>>>>>>> it) >>>>>>>>>> and can be super useful to some users who want the queries to keep >>>>>>>>>> working >>>>>>>>>> in 3.0 without rewriting. >>>>>>>>>> >>>>>>>>>> If your only concern is documentation, I totally agree that we >>>>>>>>>> should improve it. >>>>>>>>>> >>>>>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim < >>>>>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks for sharing your view. >>>>>>>>>>> >>>>>>>>>>> I agree with you it's good for Spark to promote Spark's own >>>>>>>>>>> CREATE TABLE syntax. The thing is, we still leave Hive CREATE TABLE >>>>>>>>>>> syntax >>>>>>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can >>>>>>>>>>> agree >>>>>>>>>>> with. >>>>>>>>>>> >>>>>>>>>>> I'll quote my comments in SPARK-31136 here again to make the >>>>>>>>>>> problem statement be clearer: >>>>>>>>>>> >>>>>>>>>>> I think the parser implementation around CREATE TABLE brings >>>>>>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous >>>>>>>>>>> because we >>>>>>>>>>> forced to specify USE provider if it's not a Hive table. Now it's >>>>>>>>>>> either >>>>>>>>>>> default provider or Hive according to which options are provided, >>>>>>>>>>> which >>>>>>>>>>> seems to be non-trivial to reason about. (End users would never >>>>>>>>>>> know, as >>>>>>>>>>> it's completely from parser rule.) >>>>>>>>>>> >>>>>>>>>>> I feel this as the issue of "not breaking old behavior". The >>>>>>>>>>> parser rule gets pretty much complicated due to support legacy >>>>>>>>>>> config. Not >>>>>>>>>>> breaking anything would make us be stuck eventually. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md >>>>>>>>>>> >>>>>>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will >>>>>>>>>>> use the value of spark.sql.sources.default as its provider. In Spark >>>>>>>>>>> version 2.4 and earlier, it was hive. To restore the behavior >>>>>>>>>>> before Spark >>>>>>>>>>> 3.0, you can set spark.sql.legacy.createHiveTableByDefault.enabled >>>>>>>>>>> to true. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we >>>>>>>>>>> didn't describe anything for this. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md >>>>>>>>>>> >>>>>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 >>>>>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ >>>>>>>>>>> OPTIONS ( >>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, >>>>>>>>>>> col_name2, ... >>>>>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( >>>>>>>>>>> col_name [ >>>>>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [ >>>>>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... >>>>>>>>>>> ) ] [ >>>>>>>>>>> AS select_statement ] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md >>>>>>>>>>> >>>>>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ ( >>>>>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT >>>>>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT >>>>>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT >>>>>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ >>>>>>>>>>> TBLPROPERTIES ( >>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> At least we should describe that parser will try to match the >>>>>>>>>>> first case (create table ~ using data source), and fail back to >>>>>>>>>>> second >>>>>>>>>>> case; even though we describe this it's not intuitive to reason >>>>>>>>>>> about which >>>>>>>>>>> rule the DDL query will fall into. As I commented earlier, "ROW >>>>>>>>>>> FORMAT" and >>>>>>>>>>> "STORED AS" are the options which make DDL query fall into the >>>>>>>>>>> second case, >>>>>>>>>>> but they're described as "optional" so it's hard to catch the >>>>>>>>>>> gotcha. >>>>>>>>>>> >>>>>>>>>>> Furthermore, while we document the syntax as above, in reality >>>>>>>>>>> we allow "EXTERNAL" in first rule (and throw error), which ends up >>>>>>>>>>> existing >>>>>>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add >>>>>>>>>>> "USING >>>>>>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or >>>>>>>>>>> "STORED AS". >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Simply saying, do we really think end users should stop and try >>>>>>>>>>> to match their query against the parser rules (or orders when we >>>>>>>>>>> explain in >>>>>>>>>>> the doc) by themselves to understand which provider the table will >>>>>>>>>>> leverage? I'm sorry but I think we are making bad assumption on end >>>>>>>>>>> users >>>>>>>>>>> which is a serious problem. >>>>>>>>>>> >>>>>>>>>>> If we really want to promote Spark's one for CREATE TABLE, then >>>>>>>>>>> would it really matter to treat Hive CREATE TABLE be "exceptional" >>>>>>>>>>> one and >>>>>>>>>>> try to isolate each other? What's the point of providing a legacy >>>>>>>>>>> config to >>>>>>>>>>> go back to the old one even we fear about breaking something to >>>>>>>>>>> make it >>>>>>>>>>> better or clearer? We do think that table provider is important >>>>>>>>>>> (hence the >>>>>>>>>>> change was done), then is it still a trivial problem whether the >>>>>>>>>>> provider >>>>>>>>>>> is affected by specifying the "optional" fields? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cloud0...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I think the general guideline is to promote Spark's own CREATE >>>>>>>>>>>> TABLE syntax instead of the Hive one. Previously these two rules >>>>>>>>>>>> are >>>>>>>>>>>> mutually exclusive because the native syntax requires the USING >>>>>>>>>>>> clause >>>>>>>>>>>> while the Hive syntax makes ROW FORMAT or STORED AS clause >>>>>>>>>>>> optional. >>>>>>>>>>>> >>>>>>>>>>>> It's a good move to make the USING clause optional, which makes >>>>>>>>>>>> it easier to write the native CREATE TABLE syntax. Unfortunately, >>>>>>>>>>>> it leads >>>>>>>>>>>> to some conflicts with the Hive CREATE TABLE syntax, but I don't >>>>>>>>>>>> see a >>>>>>>>>>>> serious problem here. If a user just writes CREATE TABLE without >>>>>>>>>>>> USING >>>>>>>>>>>> or ROW FORMAT or STORED AS, does it matter what table we create? >>>>>>>>>>>> Internally >>>>>>>>>>>> the parser rules conflict and we pick the native syntax depending >>>>>>>>>>>> on the >>>>>>>>>>>> rule order. But the user-facing behavior looks fine. >>>>>>>>>>>> >>>>>>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not >>>>>>>>>>>> in 3.0. Shall we simply remove EXTERNAL from the native CREATE >>>>>>>>>>>> TABLE >>>>>>>>>>>> syntax? Then CREATE EXTERNAL TABLE creates Hive table like 2.4. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim < >>>>>>>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi devs, >>>>>>>>>>>>> >>>>>>>>>>>>> I'd like to initiate discussion and hear the voices for >>>>>>>>>>>>> resolving ambiguous parser rule between two "create table"s being >>>>>>>>>>>>> brought >>>>>>>>>>>>> by SPARK-30098 [1]. >>>>>>>>>>>>> >>>>>>>>>>>>> Previously, "create table" parser rules were clearly >>>>>>>>>>>>> distinguished via "USING provider", which was very intuitive and >>>>>>>>>>>>> deterministic. Say, DDL query creates "Hive" table unless "USING >>>>>>>>>>>>> provider" >>>>>>>>>>>>> is specified, >>>>>>>>>>>>> (Please refer the parser rule in branch-2.4 [2]) >>>>>>>>>>>>> >>>>>>>>>>>>> After SPARK-30098, "create table" parser rules became >>>>>>>>>>>>> ambiguous (please refer the parser rule in branch-3.0 [3]) - the >>>>>>>>>>>>> factors >>>>>>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" >>>>>>>>>>>>> which are >>>>>>>>>>>>> all defined as "optional". Now it relies on the "order" of parser >>>>>>>>>>>>> rule >>>>>>>>>>>>> which end users would have no idea to reason about, and very >>>>>>>>>>>>> unintuitive. >>>>>>>>>>>>> >>>>>>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first >>>>>>>>>>>>> rule to provide better message) brought more confusion (I've >>>>>>>>>>>>> described the >>>>>>>>>>>>> broken existing query via SPARK-30436 [4]). >>>>>>>>>>>>> >>>>>>>>>>>>> Personally I'd like to see two rules mutually exclusive, >>>>>>>>>>>>> instead of trying to document the difference and talk end users >>>>>>>>>>>>> to be >>>>>>>>>>>>> careful about their query. I'm seeing two ways to make rules be >>>>>>>>>>>>> mutually >>>>>>>>>>>>> exclusive: >>>>>>>>>>>>> >>>>>>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE >>>>>>>>>>>>> ... "HIVE" TABLE ...`. >>>>>>>>>>>>> >>>>>>>>>>>>> pros. This is the simplest way to distinguish between two >>>>>>>>>>>>> rules. >>>>>>>>>>>>> cons. This would lead end users to change their query if they >>>>>>>>>>>>> intend to create Hive table. (Given we will also provide legacy >>>>>>>>>>>>> option I'm >>>>>>>>>>>>> feeling this is acceptable.) >>>>>>>>>>>>> >>>>>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one. >>>>>>>>>>>>> >>>>>>>>>>>>> pros. Less invasive for existing queries. >>>>>>>>>>>>> cons. Less intuitive, because they have been optional and now >>>>>>>>>>>>> become mandatory to fall into the second rule. >>>>>>>>>>>>> >>>>>>>>>>>>> Would like to hear everyone's voices; better ideas are welcome! >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>>>>>>> >>>>>>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE >>>>>>>>>>>>> TABLE syntax >>>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098 >>>>>>>>>>>>> 2. >>>>>>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>>>>>>> 3. >>>>>>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> > > -- > Ryan Blue > Software Engineer > Netflix >