Sounds good. I'll plan on adding a PR with Hive's CHANGE syntax in addition to what I've proposed here.
I have all of these working in our Spark distribution, so I'm just waiting on finalizing the TableCatalog API to submit these upstream. On Wed, Oct 3, 2018 at 10:07 PM Wenchen Fan <cloud0...@gmail.com> wrote: > Thank you Ryan for proposing the DDL syntax! I think it's good to follow > mainstream databases, and the proposed syntax looks very reasonable. > > About Hive compatibility, I think it's not that important now, but it's > still good if we keep it. Shall we support the Hive syntax as an > alternative? It seems not very hard, just a few more ANTLR rules. It will > be better if we can make ANTLR extensible and allow other data sources to > define custom SQL syntax. > > Anyway I think they are orthogonal. We can go ahead with the proposed > syntax here, and add Hive compatible syntax later. > > On Tue, Oct 2, 2018 at 11:50 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> I'd say that it was important to be compatible with Hive in the past, but >> that's becoming less important over time. Spark is well established with >> Hadoop users and I think the focus moving forward should be to make Spark >> more predictable as a SQL engine for people coming from more traditional >> databases.. >> >> That said, I think there is no problem supporting the alter syntax for >> both Hive/MySQL and the more standard versions. >> >> On Tue, Oct 2, 2018 at 8:35 AM Felix Cheung <felixcheun...@hotmail.com> >> wrote: >> >>> I think it has been an important “selling point” that Spark is “mostly >>> compatible“ with Hive DDL. >>> >>> I have see a lot of teams suffering from switching between Presto and >>> Hive dialects. >>> >>> So one question I have is, we are at a point of switch from Hive >>> compatible to ANSI SQL, say? >>> >>> Perhaps a more critical question, what does it take to get the platform >>> to support both, by making the ANTLR extensible? >>> >>> >>> >>> ------------------------------ >>> *From:* Alessandro Solimando <alessandro.solima...@gmail.com> >>> *Sent:* Tuesday, October 2, 2018 12:35 AM >>> *To:* rb...@netflix.com >>> *Cc:* Xiao Li; dev >>> *Subject:* Re: [DISCUSS] Syntax for table DDL >>> >>> I agree with Ryan, a "standard" and more widely adopted syntax is >>> usually a good idea, with possibly some slight improvements like "bulk >>> deletion" of columns (especially because both the syntax and the semantics >>> are clear), rather than stay with Hive syntax at any cost. >>> >>> I am personally following this PR with a lot of interest, thanks for all >>> the work along this direction. >>> >>> Best regards, >>> Alessandro >>> >>> On Mon, 1 Oct 2018 at 20:21, Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> What do you mean by consistent with the syntax in SqlBase.g4? These >>>> aren’t currently defined, so we need to decide what syntax to support. >>>> There are more details below, but the syntax I’m proposing is more standard >>>> across databases than Hive, which uses confusing and non-standard syntax. >>>> >>>> I doubt that we want to support Hive syntax for a few reasons. Hive >>>> uses the same column CHANGE statement for multiple purposes, so it >>>> ends up with strange patterns for simple tasks, like updating the column’s >>>> type: >>>> >>>> ALTER TABLE t CHANGE a1 a1 INT; >>>> >>>> The column name is doubled because old name, new name, and type are >>>> always required. So you have to know the type of a column to change its >>>> name and you have to double up the name to change its type. Hive also >>>> allows a couple other oddities: >>>> >>>> - Column reordering with FIRST and AFTER keywords. Column >>>> reordering is tricky to get right so I’m not sure we want to add it. >>>> - RESTRICT and CASCADE to signal whether to change all partitions >>>> or not. Spark doesn’t support partition-level schemas except through >>>> Hive, >>>> and even then I’m not sure how reliable it is. >>>> >>>> I know that we wouldn’t necessarily have to support these features from >>>> Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax >>>> if it is unlikely that Spark will implement all of the “features”? I’d >>>> rather go with SQL syntax from databases like PostgreSQL or others that are >>>> more standard and common. >>>> >>>> The more “standard” versions of these statements are like what I’ve >>>> proposed: >>>> >>>> - ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER >>>> is used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL >>>> and Oracle. COLUMN is optional in Oracle and TYPE is omitted by >>>> databases other than PosgreSQL. I think we could easily add MODIFY >>>> as an alternative to the second ALTER (and maybe alternatives like >>>> UPDATE and CHANGE) and make both TYPE and COLUMN optional. >>>> - ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName: >>>> This syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the >>>> same syntax as Hive and it appears that SQL server doesn’t have this >>>> statement. This also match the table rename syntax, which uses TO. >>>> - ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This >>>> matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN >>>> optional. Most don’t allow deleting multiple columns, but it’s a >>>> reasonable >>>> extension. >>>> >>>> While we’re on the subject of ALTER TABLE DDL, I should note that all >>>> of the databases use ADD COLUMN syntax that differs from Hive (and >>>> currently, Spark): >>>> >>>> - ALTER TABLE ident ADD COLUMN qualifiedName dataType (',' >>>> qualifiedName dataType)*: All other databases I looked at use ADD >>>> COLUMN, but not all of them support adding multiple columns at the >>>> same time. Hive requires ( and ) enclosing the columns and uses the >>>> COLUMNS keyword instead of COLUMN. I think that Spark should be >>>> updated to make the parens optional and to support both keywords, >>>> COLUMN and COLUMNS. >>>> >>>> What does everyone think? Is it reasonable to use the more standard >>>> syntax instead of using Hive as a base? >>>> >>>> rb >>>> >>>> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li <gatorsm...@gmail.com> wrote: >>>> >>>>> Are they consistent with the current syntax defined in SqlBase.g4? I >>>>> think we are following the Hive DDL syntax: >>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column >>>>> >>>>> Ryan Blue <rb...@netflix.com.invalid> 于2018年9月28日周五 下午3:47写道: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I’m currently working on new table DDL statements for v2 tables. For >>>>>> context, the new logical plans for DataSourceV2 require a catalog >>>>>> interface >>>>>> so that Spark can create tables for operations like CTAS. The proposed >>>>>> TableCatalog API also includes an API for altering those tables so we can >>>>>> make ALTER TABLE statements work. I’m implementing those DDL statements, >>>>>> which will make it into upstream Spark when the TableCatalog PR is >>>>>> merged. >>>>>> >>>>>> Since I’m adding new SQL statements that don’t yet exist in Spark, I >>>>>> want to make sure that the syntax I’m using in our branch will match the >>>>>> syntax we add to Spark later. I’m basing this proposed syntax on >>>>>> PostgreSQL >>>>>> <https://www.postgresql.org/docs/current/static/ddl-alter.html>. >>>>>> >>>>>> - *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN >>>>>> qualifiedName TYPE dataType. >>>>>> - *Rename column*: ALTER TABLE tableIdentifier RENAME COLUMN >>>>>> qualifiedName TO qualifiedName >>>>>> - *Drop column*: ALTER TABLE tableIdentifier DROP (COLUMN | >>>>>> COLUMNS) qualifiedNameList >>>>>> >>>>>> A few notes: >>>>>> >>>>>> - Using qualifiedName in these rules allows updating nested >>>>>> types, like point.x. >>>>>> - Updates and renames can only alter one column, but drop can >>>>>> drop a list. >>>>>> - Rename can’t move types and will validate that if the TO name >>>>>> is qualified, that the prefix matches the original field. >>>>>> - I’m also changing ADD COLUMN to support adding fields to nested >>>>>> columns by using qualifiedName instead of identifier. >>>>>> >>>>>> Please reply to this thread if you have suggestions based on a >>>>>> different SQL engine or want this syntax to be different for another >>>>>> reason. Thanks! >>>>>> >>>>>> rb >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix