Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Jungtaek Lim Thu, 19 Mar 2020 20:07:24 -0700

Anything would be OK if the create table DDL provides a "clear way" to
expect the table provider "before" they run the query. Great news that it
doesn't require major rework - looking forward to the PR.


Thanks again to jump in and sort this out.

- Jungtaek Lim (HeartSaVioR)

On Fri, Mar 20, 2020 at 9:36 AM Ryan Blue <rb...@netflix.com> wrote:

> I have an update to the parser that unifies the CREATE TABLE rules. It
> took surprisingly little work to get the parser updated to produce
> CreateTableStatement and CreateTableAsSelectStatement with the Hive info.
> And the only fields I need to add to those statements were serde:
> SerdeInfo and external: Boolean.
>
> From there, we can use the conversion rules to re-create the same Hive
> command for v1 or pass the data as properties for v2. I’ll work on getting
> this cleaned up and open a PR hopefully tomorrow.
>
> For the questions about how this gets converted to either a Spark or Hive
> create table command, that is really up to analyzer rules and
> configuration. With my changes, it is no longer determined by the parser:
> the parser just produces a node that includes all of the user options and
> Spark decides what to do with that in the analyzer. Also, there's already
> an option to convert Hive syntax to a Spark
> command, spark.sql.hive.convertCTAS.
>
> rb
>
> On Thu, Mar 19, 2020 at 12:46 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Big +1 to have one single unified CREATE TABLE syntax.
>>
>> In general, we can say there are 2 ways to specify the table provider:
>> USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually
>> exclusive. If none is specified, it implicitly indicates USING
>> defaultSource.
>>
>> I'm fine with a few special cases which can indicate the table provider,
>> like EXTERNAL indicates Hive Serde table. A few thoughts:
>> 1. SKEWED BY ...: We support it in Hive syntax just to fail it with a
>> nice error message. We can support it in the unified syntax as well, and
>> fail it.
>> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
>> syntax. Just make sure it doesn't appear together with PARTITIONED BY
>> transformList.
>> 3. OPTIONS: We can either map it to Hive Serde properties, or let it
>> indicate non-Hive tables.
>>
>> On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Thanks Nicholas for the side comment; you'll need to interpret "CREATE
>>> TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it
>>> may add the confusion.
>>>
>>> Ryan, thanks for the detailed analysis and proposal. That's what I would
>>> like to see in discussion thread.
>>>
>>> I'm open to solutions which enable end users to specify their intention
>>> properly - my main concern of SPARK-30098 is that it becomes unclear which
>>> provider the query will use in create table unless USING provider is
>>> explicitly specified. If the new proposal makes clear on this, that should
>>> be better than now.
>>>
>>> Replying inline:
>>>
>>> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Side comment: The current docs for CREATE TABLE
>>>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description>
>>>> add to the confusion by describing the Hive-compatible command as "CREATE
>>>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
>>>> actually part of the syntax
>>>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md>
>>>> .
>>>>
>>>> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Jungtaek, it sounds like you consider the two rules to be separate
>>>>> syntaxes with their own consistency rules. For example, if I am using the
>>>>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
>>>>> columns and requires types for those columns; if I’m using the Spark 
>>>>> syntax
>>>>> rule with USING then PARTITIONED BY must reference existing columns
>>>>> and cannot include types.
>>>>>
>>>>> I agree that this is confusing to users! We should fix it, but I don’t
>>>>> think the right solution is to continue to have two rules with divergent
>>>>> syntax.
>>>>>
>>>>> This is confusing to users because they don’t know anything about
>>>>> separate parser rules. All the user sees is that sometimes PARTITION
>>>>> BY requires types and sometimes it doesn’t. Yes, we could add a
>>>>> keyword, HIVE, to signal that the syntax is borrowed from Hive for
>>>>> that case, but that actually breaks queries that run in Hive.
>>>>>
>>>> That might less matter, because SPARK-30098 (and I guess your proposal
>>> as well) enforces end users to add "USING HIVE" for their queries to enable
>>> Hive provider in any way, even only when the query matches with rule 1
>>> (conditional). Once they decide to create Hive table, the query might have
>>> to be changed, or they have to change the default provider, or they have to
>>> enable legacy config.
>>>
>>>
>>>> I think the right solution is to unify the two syntaxes. I don’t think
>>>>> they are so different that it isn’t possible. Here are the differences I
>>>>> see:
>>>>>
>>>>>    - Only in Hive:
>>>>>       - EXTERNAL
>>>>>       - skewSpec: SKEWED BY ...
>>>>>       - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>>>>>       - createFileFormat: STORED AS ...
>>>>>    - Only in Spark:
>>>>>       - OPTIONS property list
>>>>>    - Different syntax/interpretation:
>>>>>       - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>>>>>
>>>>> ":" after column name is another one only supported in Hive, though
>>> that's relatively minor to support it in unified syntax.
>>>
>>>>
>>>>>    -
>>>>>
>>>>> For the clauses that are supported in one but not the other, we can
>>>>> add them to a unified rule as optional clauses. The AST builder would then
>>>>> validate what makes sense or not (e.g., stored as with using or row format
>>>>> delimited) and finally pass the remaining data on using the
>>>>> CreateTableStatement. That statement would be handled like we do for
>>>>> the Spark rule today, but with extra metadata to pass along. This is also 
>>>>> a
>>>>> step toward being able to put Hive behind the DSv2 API because we’d be 
>>>>> able
>>>>> to pass all of the Hive metadata clauses to the v2 catalog.
>>>>>
>>>>> The only difficult part is handling PARTITIONED BY. But in that case,
>>>>> we can use two different syntaxes from the same CREATE TABLE rule. If
>>>>> types are included, we use the Hive PARTITIONED BY syntax and convert
>>>>> in the AST builder to normalize to a single representation.
>>>>>
>>>> The proposal looks promising - it may add some sort of complexity but
>>> sounds like worth adding.
>>>
>>> One thing to make clear - in unified syntax we only rely on explicit
>>> provider, or default provider, right? I would concern if the proposal
>>> automatically uses Hive provider if Hive specific clauses are being used.
>>> Yes as I said earlier it may make end users' query to be changed, but
>>> better than uncertain.
>>>
>>> Btw, if the main purpose to add native syntax and change it by default
>>> is to discontinue supporting Hive create table rule sooner, simply dropping
>>> rule 2 with providing legacy config is still one of valid options I think.
>>>
>>>
>>>> What do you both think? This would make the behavior more clear and
>>>>> take a step toward getting rid of Hive-specific code.
>>>>>
>>>>> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> I'm trying to understand the reason you have been suggesting to keep
>>>>>> the real thing unchanged but change doc instead. Could you please 
>>>>>> elaborate
>>>>>> why? End users would blame us when they hit the case their query doesn't
>>>>>> work as intended (1) and found the fact it's undocumented (2) and hard to
>>>>>> understand even from the Spark codebase (3).
>>>>>>
>>>>>> For me, addressing the root issue adopting your suggestion would be
>>>>>> "dropping the rule 2" and only supporting it with legacy config on. We
>>>>>> would say to end users, you need to enable the legacy config to leverage
>>>>>> Hive create table syntax, or just use beeline with Hive connected.
>>>>>>
>>>>>> But since we are even thinking about native syntax as a first class
>>>>>> and dropping Hive one implicitly (hide in doc) or explicitly, does it
>>>>>> really matter we require a marker (like "HIVE") in rule 2 and isolate it?
>>>>>> It would have even less confusion than Spark 2.x, since we will require 
>>>>>> end
>>>>>> users to fill the marker regarding Hive when creating Hive table, easier 
>>>>>> to
>>>>>> classify than "USING provider".
>>>>>>
>>>>>> If we think native syntax would cover many cases end users have been
>>>>>> creating Hive table in Spark (say, USING hive would simply work for 
>>>>>> them),
>>>>>> I'm OK to drop the rule 2 and lead end users to enable the legacy config 
>>>>>> if
>>>>>> really needed. If not, let's continue "fixing" the issue.
>>>>>>
>>>>>> (Another valid approach would be consolidating two rules into one,
>>>>>> and defining support of parameters per provider, e.g. EXTERNAL, STORED 
>>>>>> AS,
>>>>>> ROW FORMAT, etc. are only supported in Hive provider.)
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The fact that we have 2 CREATE TABLE syntax is already confusing
>>>>>>> many users. Shall we only document the native syntax? Then users don't 
>>>>>>> need
>>>>>>> to worry about which rule their query fits and they don't need to spend 
>>>>>>> a
>>>>>>> lot of time understanding the subtle difference between these 2 
>>>>>>> syntaxes.
>>>>>>>
>>>>>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <
>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>
>>>>>>>> A bit correction: the example I provided for vice versa is not
>>>>>>>> really a correct case for vice versa. It's actually same case 
>>>>>>>> (intended to
>>>>>>>> use rule 2 which is not default) but different result.
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <
>>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> My concern is that although we simply think about the change to
>>>>>>>>> mark "USING provider" to be optional in rule 1, but in reality the 
>>>>>>>>> change
>>>>>>>>> is most likely swapping the default rule for CREATE TABLE, which was 
>>>>>>>>> "rule
>>>>>>>>> 2", and now "rule 1" (it would be the happy case of migration doc if 
>>>>>>>>> the
>>>>>>>>> swap happens as intended), and there're still couple of things which 
>>>>>>>>> make
>>>>>>>>> the query still fall into rule 2 which is non-trivial to reason about 
>>>>>>>>> and
>>>>>>>>> also not easy to explain.
>>>>>>>>>
>>>>>>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into
>>>>>>>>> rule 2 to simplify the problem statement, but they're not only the 
>>>>>>>>> case
>>>>>>>>> - using col_name1:col_type1 would make the query fall into rule 2
>>>>>>>>> regardless of any other properties.
>>>>>>>>>
>>>>>>>>> What's the matter? In Spark 2.x, if end users want to use rule 1
>>>>>>>>> (which is not default) and specify the parameters which are only 
>>>>>>>>> available
>>>>>>>>> in rule 1, it clearly requires "USING provider" - parser will throw 
>>>>>>>>> error
>>>>>>>>> if there're any mistakes on specifying parameters. End users could set
>>>>>>>>> intention clearly which rule the query should be bound. If the query 
>>>>>>>>> fails
>>>>>>>>> to bind the rule as intended, it's simply denied.
>>>>>>>>>
>>>>>>>>> In Spark 3.x, parser may not help figuring out the case where end
>>>>>>>>> users intend to use rule 2 (which is not default) but some mistake on
>>>>>>>>> specifying parameters - it could just "silently" bound into rule 1 
>>>>>>>>> and it
>>>>>>>>> may be even executed without any error. Vice versa happens, but in 
>>>>>>>>> odd way
>>>>>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird message 
>>>>>>>>> CREATE
>>>>>>>>> EXTERNAL TABLE is not supported.
>>>>>>>>>
>>>>>>>>> It's deterministic for end users only if they're fully understand
>>>>>>>>> the difference between the twos and also understand how Spark would 
>>>>>>>>> apply
>>>>>>>>> the rules to make the query fall into one of the rule. I'm not sure 
>>>>>>>>> how we
>>>>>>>>> can only improve documentation to make things be clear, but if the 
>>>>>>>>> approach
>>>>>>>>> would be explaining the difference of rules and guide the tip to make 
>>>>>>>>> the
>>>>>>>>> query be bound to the specific rule, the same could be applied to 
>>>>>>>>> parser
>>>>>>>>> rule to address the root cause.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cloud0...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits
>>>>>>>>>> both native and Hive syntax. I'm fine with some changes to make it 
>>>>>>>>>> less
>>>>>>>>>> confusing, as long as the user-facing behavior doesn't change. For 
>>>>>>>>>> example,
>>>>>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy 
>>>>>>>>>> config
>>>>>>>>>> is false.
>>>>>>>>>>
>>>>>>>>>> I still don't get your point about what's the real problem to end
>>>>>>>>>> users. There is no ambiguity as the behavior is deterministic, 
>>>>>>>>>> although we
>>>>>>>>>> rely on optional fields and rule order which is bad. It's hard to 
>>>>>>>>>> document,
>>>>>>>>>> but I don't think that's a big problem to end users.
>>>>>>>>>>
>>>>>>>>>> For the legacy config, it does make the implementation more
>>>>>>>>>> complicated, but it's invisible to most end users (we don't document 
>>>>>>>>>> it)
>>>>>>>>>> and can be super useful to some users who want the queries to keep 
>>>>>>>>>> working
>>>>>>>>>> in 3.0 without rewriting.
>>>>>>>>>>
>>>>>>>>>> If your only concern is documentation, I totally agree that we
>>>>>>>>>> should improve it.
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>>>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for sharing your view.
>>>>>>>>>>>
>>>>>>>>>>> I agree with you it's good for Spark to promote Spark's own
>>>>>>>>>>> CREATE TABLE syntax. The thing is, we still leave Hive CREATE TABLE 
>>>>>>>>>>> syntax
>>>>>>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can 
>>>>>>>>>>> agree
>>>>>>>>>>> with.
>>>>>>>>>>>
>>>>>>>>>>> I'll quote my comments in SPARK-31136 here again to make the
>>>>>>>>>>> problem statement be clearer:
>>>>>>>>>>>
>>>>>>>>>>> I think the parser implementation around CREATE TABLE brings
>>>>>>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous 
>>>>>>>>>>> because we
>>>>>>>>>>> forced to specify USE provider if it's not a Hive table. Now it's 
>>>>>>>>>>> either
>>>>>>>>>>> default provider or Hive according to which options are provided, 
>>>>>>>>>>> which
>>>>>>>>>>> seems to be non-trivial to reason about. (End users would never 
>>>>>>>>>>> know, as
>>>>>>>>>>> it's completely from parser rule.)
>>>>>>>>>>>
>>>>>>>>>>> I feel this as the issue of "not breaking old behavior". The
>>>>>>>>>>> parser rule gets pretty much complicated due to support legacy 
>>>>>>>>>>> config. Not
>>>>>>>>>>> breaking anything would make us be stuck eventually.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>>>>>>>>
>>>>>>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will
>>>>>>>>>>> use the value of spark.sql.sources.default as its provider. In Spark
>>>>>>>>>>> version 2.4 and earlier, it was hive. To restore the behavior 
>>>>>>>>>>> before Spark
>>>>>>>>>>> 3.0, you can set spark.sql.legacy.createHiveTableByDefault.enabled 
>>>>>>>>>>> to true.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we
>>>>>>>>>>> didn't describe anything for this.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>>>>>>>>
>>>>>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1
>>>>>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ 
>>>>>>>>>>> OPTIONS (
>>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, 
>>>>>>>>>>> col_name2, ...
>>>>>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( 
>>>>>>>>>>> col_name [
>>>>>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>>>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... 
>>>>>>>>>>> ) ] [
>>>>>>>>>>> AS select_statement ]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>>>>>>>>
>>>>>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>>>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>>>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>>>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>>>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ 
>>>>>>>>>>> TBLPROPERTIES (
>>>>>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> At least we should describe that parser will try to match the
>>>>>>>>>>> first case (create table ~ using data source), and fail back to 
>>>>>>>>>>> second
>>>>>>>>>>> case; even though we describe this it's not intuitive to reason 
>>>>>>>>>>> about which
>>>>>>>>>>> rule the DDL query will fall into. As I commented earlier, "ROW 
>>>>>>>>>>> FORMAT" and
>>>>>>>>>>> "STORED AS" are the options which make DDL query fall into the 
>>>>>>>>>>> second case,
>>>>>>>>>>> but they're described as "optional" so it's hard to catch the 
>>>>>>>>>>> gotcha.
>>>>>>>>>>>
>>>>>>>>>>> Furthermore, while we document the syntax as above, in reality
>>>>>>>>>>> we allow "EXTERNAL" in first rule (and throw error), which ends up 
>>>>>>>>>>> existing
>>>>>>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add 
>>>>>>>>>>> "USING
>>>>>>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or 
>>>>>>>>>>> "STORED AS".
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Simply saying, do we really think end users should stop and try
>>>>>>>>>>> to match their query against the parser rules (or orders when we 
>>>>>>>>>>> explain in
>>>>>>>>>>> the doc) by themselves to understand which provider the table will
>>>>>>>>>>> leverage? I'm sorry but I think we are making bad assumption on end 
>>>>>>>>>>> users
>>>>>>>>>>> which is a serious problem.
>>>>>>>>>>>
>>>>>>>>>>> If we really want to promote Spark's one for CREATE TABLE, then
>>>>>>>>>>> would it really matter to treat Hive CREATE TABLE be "exceptional" 
>>>>>>>>>>> one and
>>>>>>>>>>> try to isolate each other? What's the point of providing a legacy 
>>>>>>>>>>> config to
>>>>>>>>>>> go back to the old one even we fear about breaking something to 
>>>>>>>>>>> make it
>>>>>>>>>>> better or clearer? We do think that table provider is important 
>>>>>>>>>>> (hence the
>>>>>>>>>>> change was done), then is it still a trivial problem whether the 
>>>>>>>>>>> provider
>>>>>>>>>>> is affected by specifying the "optional" fields?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cloud0...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I think the general guideline is to promote Spark's own CREATE
>>>>>>>>>>>> TABLE syntax instead of the Hive one. Previously these two rules 
>>>>>>>>>>>> are
>>>>>>>>>>>> mutually exclusive because the native syntax requires the USING 
>>>>>>>>>>>> clause
>>>>>>>>>>>> while the Hive syntax makes ROW FORMAT or STORED AS clause 
>>>>>>>>>>>> optional.
>>>>>>>>>>>>
>>>>>>>>>>>> It's a good move to make the USING clause optional, which makes
>>>>>>>>>>>> it easier to write the native CREATE TABLE syntax. Unfortunately, 
>>>>>>>>>>>> it leads
>>>>>>>>>>>> to some conflicts with the Hive CREATE TABLE syntax, but I don't 
>>>>>>>>>>>> see a
>>>>>>>>>>>> serious problem here. If a user just writes CREATE TABLE without 
>>>>>>>>>>>> USING
>>>>>>>>>>>> or ROW FORMAT or STORED AS, does it matter what table we create? 
>>>>>>>>>>>> Internally
>>>>>>>>>>>> the parser rules conflict and we pick the native syntax depending 
>>>>>>>>>>>> on the
>>>>>>>>>>>> rule order. But the user-facing behavior looks fine.
>>>>>>>>>>>>
>>>>>>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not
>>>>>>>>>>>> in 3.0. Shall we simply remove EXTERNAL from the native CREATE 
>>>>>>>>>>>> TABLE
>>>>>>>>>>>> syntax? Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi devs,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd like to initiate discussion and hear the voices for
>>>>>>>>>>>>> resolving ambiguous parser rule between two "create table"s being 
>>>>>>>>>>>>> brought
>>>>>>>>>>>>> by SPARK-30098 [1].
>>>>>>>>>>>>>
>>>>>>>>>>>>> Previously, "create table" parser rules were clearly
>>>>>>>>>>>>> distinguished via "USING provider", which was very intuitive and
>>>>>>>>>>>>> deterministic. Say, DDL query creates "Hive" table unless "USING 
>>>>>>>>>>>>> provider"
>>>>>>>>>>>>> is specified,
>>>>>>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>>>>>>
>>>>>>>>>>>>> After SPARK-30098, "create table" parser rules became
>>>>>>>>>>>>> ambiguous (please refer the parser rule in branch-3.0 [3]) - the 
>>>>>>>>>>>>> factors
>>>>>>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" 
>>>>>>>>>>>>> which are
>>>>>>>>>>>>> all defined as "optional". Now it relies on the "order" of parser 
>>>>>>>>>>>>> rule
>>>>>>>>>>>>> which end users would have no idea to reason about, and very 
>>>>>>>>>>>>> unintuitive.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first
>>>>>>>>>>>>> rule to provide better message) brought more confusion (I've 
>>>>>>>>>>>>> described the
>>>>>>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Personally I'd like to see two rules mutually exclusive,
>>>>>>>>>>>>> instead of trying to document the difference and talk end users 
>>>>>>>>>>>>> to be
>>>>>>>>>>>>> careful about their query. I'm seeing two ways to make rules be 
>>>>>>>>>>>>> mutually
>>>>>>>>>>>>> exclusive:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE
>>>>>>>>>>>>> ... "HIVE" TABLE ...`.
>>>>>>>>>>>>>
>>>>>>>>>>>>> pros. This is the simplest way to distinguish between two
>>>>>>>>>>>>> rules.
>>>>>>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>>>>>>> intend to create Hive table. (Given we will also provide legacy 
>>>>>>>>>>>>> option I'm
>>>>>>>>>>>>> feeling this is acceptable.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>>>>>>
>>>>>>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE
>>>>>>>>>>>>> TABLE syntax
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>>>>>>> 2.
>>>>>>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>>>> 3.
>>>>>>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Reply via email to