Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Wenchen Fan
Big +1 to have one single unified CREATE TABLE syntax.

In general, we can say there are 2 ways to specify the table provider:
USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually
exclusive. If none is specified, it implicitly indicates USING defaultSource
.

I'm fine with a few special cases which can indicate the table provider,
like EXTERNAL indicates Hive Serde table. A few thoughts:
1. SKEWED BY ...: We support it in Hive syntax just to fail it with a nice
error message. We can support it in the unified syntax as well, and fail it.
2. PARTITIONED BY colTypeList: I think we can support it in the unified
syntax. Just make sure it doesn't appear together with PARTITIONED BY
transformList.
3. OPTIONS: We can either map it to Hive Serde properties, or let it
indicate non-Hive tables.

On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim 
wrote:

> Thanks Nicholas for the side comment; you'll need to interpret "CREATE
> TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it
> may add the confusion.
>
> Ryan, thanks for the detailed analysis and proposal. That's what I would
> like to see in discussion thread.
>
> I'm open to solutions which enable end users to specify their intention
> properly - my main concern of SPARK-30098 is that it becomes unclear which
> provider the query will use in create table unless USING provider is
> explicitly specified. If the new proposal makes clear on this, that should
> be better than now.
>
> Replying inline:
>
> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Side comment: The current docs for CREATE TABLE
>> 
>> add to the confusion by describing the Hive-compatible command as "CREATE
>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
>> actually part of the syntax
>> 
>> .
>>
>> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue 
>> wrote:
>>
>>> Jungtaek, it sounds like you consider the two rules to be separate
>>> syntaxes with their own consistency rules. For example, if I am using the
>>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
>>> columns and requires types for those columns; if I’m using the Spark syntax
>>> rule with USING then PARTITIONED BY must reference existing columns and
>>> cannot include types.
>>>
>>> I agree that this is confusing to users! We should fix it, but I don’t
>>> think the right solution is to continue to have two rules with divergent
>>> syntax.
>>>
>>> This is confusing to users because they don’t know anything about
>>> separate parser rules. All the user sees is that sometimes PARTITION BY
>>> requires types and sometimes it doesn’t. Yes, we could add a keyword,
>>> HIVE, to signal that the syntax is borrowed from Hive for that case,
>>> but that actually breaks queries that run in Hive.
>>>
>> That might less matter, because SPARK-30098 (and I guess your proposal as
> well) enforces end users to add "USING HIVE" for their queries to enable
> Hive provider in any way, even only when the query matches with rule 1
> (conditional). Once they decide to create Hive table, the query might have
> to be changed, or they have to change the default provider, or they have to
> enable legacy config.
>
>
>> I think the right solution is to unify the two syntaxes. I don’t think
>>> they are so different that it isn’t possible. Here are the differences I
>>> see:
>>>
>>>- Only in Hive:
>>>   - EXTERNAL
>>>   - skewSpec: SKEWED BY ...
>>>   - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
>>>   - createFileFormat: STORED AS ...
>>>- Only in Spark:
>>>   - OPTIONS property list
>>>- Different syntax/interpretation:
>>>   - PARTITIONED BY transformList / PARTITIONED BY colTypeList
>>>
>>> ":" after column name is another one only supported in Hive, though
> that's relatively minor to support it in unified syntax.
>
>>
>>>-
>>>
>>> For the clauses that are supported in one but not the other, we can add
>>> them to a unified rule as optional clauses. The AST builder would then
>>> validate what makes sense or not (e.g., stored as with using or row format
>>> delimited) and finally pass the remaining data on using the
>>> CreateTableStatement. That statement would be handled like we do for
>>> the Spark rule today, but with extra metadata to pass along. This is also a
>>> step toward being able to put Hive behind the DSv2 API because we’d be able
>>> to pass all of the Hive metadata clauses to the v2 catalog.
>>>
>>> The only difficult part is handling PARTITIONED BY. But in that case,
>>> we can use two different syntaxes from the same CREATE TABLE rule. If
>>> types are included, we use the Hive PARTITIONED BY syntax and convert
>>> 

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Ryan Blue
I have an update to the parser that unifies the CREATE TABLE rules. It took
surprisingly little work to get the parser updated to produce
CreateTableStatement and CreateTableAsSelectStatement with the Hive info.
And the only fields I need to add to those statements were serde: SerdeInfo
and external: Boolean.

>From there, we can use the conversion rules to re-create the same Hive
command for v1 or pass the data as properties for v2. I’ll work on getting
this cleaned up and open a PR hopefully tomorrow.

For the questions about how this gets converted to either a Spark or Hive
create table command, that is really up to analyzer rules and
configuration. With my changes, it is no longer determined by the parser:
the parser just produces a node that includes all of the user options and
Spark decides what to do with that in the analyzer. Also, there's already
an option to convert Hive syntax to a Spark
command, spark.sql.hive.convertCTAS.

rb

On Thu, Mar 19, 2020 at 12:46 AM Wenchen Fan  wrote:

> Big +1 to have one single unified CREATE TABLE syntax.
>
> In general, we can say there are 2 ways to specify the table provider:
> USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually
> exclusive. If none is specified, it implicitly indicates USING
> defaultSource.
>
> I'm fine with a few special cases which can indicate the table provider,
> like EXTERNAL indicates Hive Serde table. A few thoughts:
> 1. SKEWED BY ...: We support it in Hive syntax just to fail it with a
> nice error message. We can support it in the unified syntax as well, and
> fail it.
> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
> syntax. Just make sure it doesn't appear together with PARTITIONED BY
> transformList.
> 3. OPTIONS: We can either map it to Hive Serde properties, or let it
> indicate non-Hive tables.
>
> On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim 
> wrote:
>
>> Thanks Nicholas for the side comment; you'll need to interpret "CREATE
>> TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it
>> may add the confusion.
>>
>> Ryan, thanks for the detailed analysis and proposal. That's what I would
>> like to see in discussion thread.
>>
>> I'm open to solutions which enable end users to specify their intention
>> properly - my main concern of SPARK-30098 is that it becomes unclear which
>> provider the query will use in create table unless USING provider is
>> explicitly specified. If the new proposal makes clear on this, that should
>> be better than now.
>>
>> Replying inline:
>>
>> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Side comment: The current docs for CREATE TABLE
>>> 
>>> add to the confusion by describing the Hive-compatible command as "CREATE
>>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
>>> actually part of the syntax
>>> 
>>> .
>>>
>>> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue 
>>> wrote:
>>>
 Jungtaek, it sounds like you consider the two rules to be separate
 syntaxes with their own consistency rules. For example, if I am using the
 Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
 columns and requires types for those columns; if I’m using the Spark syntax
 rule with USING then PARTITIONED BY must reference existing columns
 and cannot include types.

 I agree that this is confusing to users! We should fix it, but I don’t
 think the right solution is to continue to have two rules with divergent
 syntax.

 This is confusing to users because they don’t know anything about
 separate parser rules. All the user sees is that sometimes PARTITION BY
 requires types and sometimes it doesn’t. Yes, we could add a keyword,
 HIVE, to signal that the syntax is borrowed from Hive for that case,
 but that actually breaks queries that run in Hive.

>>> That might less matter, because SPARK-30098 (and I guess your proposal
>> as well) enforces end users to add "USING HIVE" for their queries to enable
>> Hive provider in any way, even only when the query matches with rule 1
>> (conditional). Once they decide to create Hive table, the query might have
>> to be changed, or they have to change the default provider, or they have to
>> enable legacy config.
>>
>>
>>> I think the right solution is to unify the two syntaxes. I don’t think
 they are so different that it isn’t possible. Here are the differences I
 see:

- Only in Hive:
   - EXTERNAL
   - skewSpec: SKEWED BY ...
   - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ...
   - createFileFormat: STORED AS ...
- Only in Spark:
   

Re: Spark-3.0 - performance degradation

2020-03-19 Thread beliefer
I test it and cannot reproduce the issue.
I build Spark-3.1.0 and Spark2.3.1.
After many tests, it is found that there is little difference between them,
and they win and lose each other.
And from the view of event timeline, Spark-3.1.0 looks more accurate.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Jungtaek Lim
Anything would be OK if the create table DDL provides a "clear way" to
expect the table provider "before" they run the query. Great news that it
doesn't require major rework - looking forward to the PR.

Thanks again to jump in and sort this out.

- Jungtaek Lim (HeartSaVioR)

On Fri, Mar 20, 2020 at 9:36 AM Ryan Blue  wrote:

> I have an update to the parser that unifies the CREATE TABLE rules. It
> took surprisingly little work to get the parser updated to produce
> CreateTableStatement and CreateTableAsSelectStatement with the Hive info.
> And the only fields I need to add to those statements were serde:
> SerdeInfo and external: Boolean.
>
> From there, we can use the conversion rules to re-create the same Hive
> command for v1 or pass the data as properties for v2. I’ll work on getting
> this cleaned up and open a PR hopefully tomorrow.
>
> For the questions about how this gets converted to either a Spark or Hive
> create table command, that is really up to analyzer rules and
> configuration. With my changes, it is no longer determined by the parser:
> the parser just produces a node that includes all of the user options and
> Spark decides what to do with that in the analyzer. Also, there's already
> an option to convert Hive syntax to a Spark
> command, spark.sql.hive.convertCTAS.
>
> rb
>
> On Thu, Mar 19, 2020 at 12:46 AM Wenchen Fan  wrote:
>
>> Big +1 to have one single unified CREATE TABLE syntax.
>>
>> In general, we can say there are 2 ways to specify the table provider:
>> USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually
>> exclusive. If none is specified, it implicitly indicates USING
>> defaultSource.
>>
>> I'm fine with a few special cases which can indicate the table provider,
>> like EXTERNAL indicates Hive Serde table. A few thoughts:
>> 1. SKEWED BY ...: We support it in Hive syntax just to fail it with a
>> nice error message. We can support it in the unified syntax as well, and
>> fail it.
>> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
>> syntax. Just make sure it doesn't appear together with PARTITIONED BY
>> transformList.
>> 3. OPTIONS: We can either map it to Hive Serde properties, or let it
>> indicate non-Hive tables.
>>
>> On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Thanks Nicholas for the side comment; you'll need to interpret "CREATE
>>> TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it
>>> may add the confusion.
>>>
>>> Ryan, thanks for the detailed analysis and proposal. That's what I would
>>> like to see in discussion thread.
>>>
>>> I'm open to solutions which enable end users to specify their intention
>>> properly - my main concern of SPARK-30098 is that it becomes unclear which
>>> provider the query will use in create table unless USING provider is
>>> explicitly specified. If the new proposal makes clear on this, that should
>>> be better than now.
>>>
>>> Replying inline:
>>>
>>> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Side comment: The current docs for CREATE TABLE
 
 add to the confusion by describing the Hive-compatible command as "CREATE
 TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
 actually part of the syntax
 
 .

 On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue 
 wrote:

> Jungtaek, it sounds like you consider the two rules to be separate
> syntaxes with their own consistency rules. For example, if I am using the
> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
> columns and requires types for those columns; if I’m using the Spark 
> syntax
> rule with USING then PARTITIONED BY must reference existing columns
> and cannot include types.
>
> I agree that this is confusing to users! We should fix it, but I don’t
> think the right solution is to continue to have two rules with divergent
> syntax.
>
> This is confusing to users because they don’t know anything about
> separate parser rules. All the user sees is that sometimes PARTITION
> BY requires types and sometimes it doesn’t. Yes, we could add a
> keyword, HIVE, to signal that the syntax is borrowed from Hive for
> that case, but that actually breaks queries that run in Hive.
>
 That might less matter, because SPARK-30098 (and I guess your proposal
>>> as well) enforces end users to add "USING HIVE" for their queries to enable
>>> Hive provider in any way, even only when the query matches with rule 1
>>> (conditional). Once they decide to create Hive table, the query might have
>>> to be changed, or they have to ch

Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Dongjoon Hyun
+1 for Wenchen's suggestion.

I believe that the difference and effects are informed widely and discussed
in many ways twice.

First, this was shared on last December.

"FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax", 2019/12/06

https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

Second (at this time in this thread), this has been discussed according to
the new community rubric.

- https://spark.apache.org/versioning-policy.html (Section:
"Considerations When Breaking APIs")

Thank you all.

Bests,
Dongjoon.

On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan  wrote:

> OK let me put a proposal here:
>
> 1. Permanently ban CHAR for native data source tables, and only keep it
> for Hive compatibility.
> It's OK to forget about padding like what Snowflake and MySQL have done.
> But it's hard for Spark to require consistent behavior about CHAR type in
> all data sources. Since CHAR type is not that useful nowadays, seems OK to
> just ban it. Another way is to document that the padding of CHAR type is
> data source dependent, but it's a bit weird to leave this inconsistency in
> Spark.
>
> 2. Leave VARCHAR unchanged in 3.0
> VARCHAR type is so widely used in databases and it's weird if Spark
> doesn't support it. VARCHAR type is exactly the same as Spark StringType
> when the length limitation is not hit, and I'm fine to temporarily leave
> this flaw in 3.0 and users may hit behavior changes when the string values
> hit the VARCHAR length limitation.
>
> 3. Finalize the VARCHAR behavior in 3.1
> For now I have 2 ideas:
> a) Make VARCHAR(x) a first-class data type. This means Spark data sources
> should support VARCHAR, and CREATE TABLE should fail if a column is VARCHAR
> type and the underlying data source doesn't support it (e.g. JSON/CSV).
> Type cast, type coercion, table insertion, etc. should be updated as well.
> b) Simply document that, the underlying data source may or may not enforce
> the length limitation of VARCHAR(x).
>
> Please let me know if you have different ideas.
>
> Thanks,
> Wenchen
>
> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust 
> wrote:
>
>> What I'd oppose is to just ban char for the native data sources, and do
>>> not have a plan to address this problem systematically.
>>>
>>
>> +1
>>
>>
>>> Just forget about padding, like what Snowflake and MySQL have done.
>>> Document that char(x) is just an alias for string. And then move on. Almost
>>> no work needs to be done...
>>>
>>
>> +1
>>
>>
>


Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin
You are joking when you said " informed widely and discussed in many ways 
twice" right?

This thread doesn't even talk about char/varchar: 
https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

(Yes it talked about changing the default data source provider, but that's just 
one of the ways we are exposing this char/varchar issue).

On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> +1 for Wenchen's suggestion.
> 
> I believe that the difference and effects are informed widely and
> discussed in many ways twice.
> 
> First, this was shared on last December.
> 
>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
>    https:/ / lists. apache. org/ thread. html/ 
> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
> )
> 
> Second (at this time in this thread), this has been discussed according to
> the new community rubric.
> 
>     - https:/ / spark. apache. org/ versioning-policy. html (
> https://spark.apache.org/versioning-policy.html ) (Section: "Considerations
> When Breaking APIs")
> 
> 
> Thank you all.
> 
> 
> Bests,
> Dongjoon.
> 
> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan < cloud0fan@ gmail. com (
> cloud0...@gmail.com ) > wrote:
> 
> 
>> OK let me put a proposal here:
>> 
>> 
>> 1. Permanently ban CHAR for native data source tables, and only keep it
>> for Hive compatibility.
>> It's OK to forget about padding like what Snowflake and MySQL have done.
>> But it's hard for Spark to require consistent behavior about CHAR type in
>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>> just ban it. Another way is to document that the padding of CHAR type is
>> data source dependent, but it's a bit weird to leave this inconsistency in
>> Spark.
>> 
>> 
>> 2. Leave VARCHAR unchanged in 3.0
>> VARCHAR type is so widely used in databases and it's weird if Spark
>> doesn't support it. VARCHAR type is exactly the same as Spark StringType
>> when the length limitation is not hit, and I'm fine to temporarily leave
>> this flaw in 3.0 and users may hit behavior changes when the string values
>> hit the VARCHAR length limitation.
>> 
>> 
>> 3. Finalize the VARCHAR behavior in 3.1
>> For now I have 2 ideas:
>> a) Make VARCHAR(x) a first-class data type. This means Spark data sources
>> should support VARCHAR, and CREATE TABLE should fail if a column is
>> VARCHAR type and the underlying data source doesn't support it (e.g.
>> JSON/CSV). Type cast, type coercion, table insertion, etc. should be
>> updated as well.
>> b) Simply document that, the underlying data source may or may not enforce
>> the length limitation of VARCHAR(x).
>> 
>> 
>> Please let me know if you have different ideas.
>> 
>> 
>> Thanks,
>> Wenchen
>> 
>> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust < michael@ databricks. com
>> ( mich...@databricks.com ) > wrote:
>> 
>> 
>>> 
 What I'd oppose is to just ban char for the native data sources, and do
 not have a plan to address this problem systematically.
 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>>  
>>> 
 Just forget about padding, like what Snowflake and MySQL have done.
 Document that char(x) is just an alias for string. And then move on.
 Almost no work needs to be done...
 
>>> 
>>> 
>>> 
>>> +1 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Dongjoon Hyun
Technically, I has been suffered with (1) `CREATE TABLE` due to many
difference for a long time (since 2017). So, I had a wrong assumption for
the implication of that "(2) FYI: SPARK-30098 Use default datasource as
provider for CREATE TABLE syntax", Reynold. I admit that. You may not feel
in the similar way. However, it was a lot to me. Also, switching
`convertMetastoreOrc` at 2.4 was a big change to me although there will be
no difference for Parquet-only users.

Dongjoon.

> References:
> 1. "CHAR implementation?", 2017/09/15
>
https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax", 2019/12/06
>
https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E



On Thu, Mar 19, 2020 at 8:47 PM Reynold Xin  wrote:

> You are joking when you said " informed widely and discussed in many ways
> twice" right?
>
> This thread doesn't even talk about char/varchar:
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>
> (Yes it talked about changing the default data source provider, but that's
> just one of the ways we are exposing this char/varchar issue).
>
>
>
> On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun 
> wrote:
>
>> +1 for Wenchen's suggestion.
>>
>> I believe that the difference and effects are informed widely and
>> discussed in many ways twice.
>>
>> First, this was shared on last December.
>>
>> "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>> syntax", 2019/12/06
>>
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>
>> Second (at this time in this thread), this has been discussed according
>> to the new community rubric.
>>
>> - https://spark.apache.org/versioning-policy.html (Section:
>> "Considerations When Breaking APIs")
>>
>> Thank you all.
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan  wrote:
>>
>>> OK let me put a proposal here:
>>>
>>> 1. Permanently ban CHAR for native data source tables, and only keep it
>>> for Hive compatibility.
>>> It's OK to forget about padding like what Snowflake and MySQL have done.
>>> But it's hard for Spark to require consistent behavior about CHAR type in
>>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>>> just ban it. Another way is to document that the padding of CHAR type is
>>> data source dependent, but it's a bit weird to leave this inconsistency in
>>> Spark.
>>>
>>> 2. Leave VARCHAR unchanged in 3.0
>>> VARCHAR type is so widely used in databases and it's weird if Spark
>>> doesn't support it. VARCHAR type is exactly the same as Spark StringType
>>> when the length limitation is not hit, and I'm fine to temporarily leave
>>> this flaw in 3.0 and users may hit behavior changes when the string values
>>> hit the VARCHAR length limitation.
>>>
>>> 3. Finalize the VARCHAR behavior in 3.1
>>> For now I have 2 ideas:
>>> a) Make VARCHAR(x) a first-class data type. This means Spark data
>>> sources should support VARCHAR, and CREATE TABLE should fail if a column is
>>> VARCHAR type and the underlying data source doesn't support it (e.g.
>>> JSON/CSV). Type cast, type coercion, table insertion, etc. should be
>>> updated as well.
>>> b) Simply document that, the underlying data source may or may not
>>> enforce the length limitation of VARCHAR(x).
>>>
>>> Please let me know if you have different ideas.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust 
>>> wrote:
>>>
 What I'd oppose is to just ban char for the native data sources, and do
> not have a plan to address this problem systematically.
>

 +1


> Just forget about padding, like what Snowflake and MySQL have done.
> Document that char(x) is just an alias for string. And then move on. 
> Almost
> no work needs to be done...
>

 +1

>>>
>


Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin
I agree it sucks. We started with some decision that might have made sense back 
in 2013 (let's use Hive as the default source, and guess what, pick the slowest 
possible serde by default). We are paying that debt ever since.

Thanks for bringing this thread up though. We don't have a clear solution yet, 
but at least it made a lot of people aware of the issues.

On Thu, Mar 19, 2020 at 8:56 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Technically, I has been suffered with (1) `CREATE TABLE` due to many
> difference for a long time (since 2017). So, I had a wrong assumption for
> the implication of that "(2) FYI: SPARK-30098 Use default datasource as
> provider for CREATE TABLE syntax", Reynold. I admit that. You may not feel
> in the similar way. However, it was a lot to me. Also, switching
> `convertMetastoreOrc` at 2.4 was a big change to me although there will be
> no difference for Parquet-only users.
> 
> 
> Dongjoon.
> 
> 
> > References:
> > 1. "CHAR implementation?", 2017/09/15
> >      https:/ / lists. apache. org/ thread. html/ 
> > 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> )
> > 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
> >    https:/ / lists. apache. org/ thread. html/ 
> > 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> 
> 
> 
> On Thu, Mar 19, 2020 at 8:47 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> You are joking when you said " informed widely and discussed in many ways
>> twice" right?
>> 
>> 
>> 
>> This thread doesn't even talk about char/varchar: https:/ / lists. apache.
>> org/ thread. html/ 
>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>> spark. apache. org%3E (
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>> )
>> 
>> 
>> 
>> (Yes it talked about changing the default data source provider, but that's
>> just one of the ways we are exposing this char/varchar issue).
>> 
>> 
>> 
>> 
>> 
>> On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.h...@gmail.com ) > wrote:
>> 
>>> +1 for Wenchen's suggestion.
>>> 
>>> I believe that the difference and effects are informed widely and
>>> discussed in many ways twice.
>>> 
>>> First, this was shared on last December.
>>> 
>>>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax", 2019/12/06
>>>    https:/ / lists. apache. org/ thread. html/ 
>>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>> )
>>> 
>>> Second (at this time in this thread), this has been discussed according to
>>> the new community rubric.
>>> 
>>>     - https:/ / spark. apache. org/ versioning-policy. html (
>>> https://spark.apache.org/versioning-policy.html ) (Section: "Considerations
>>> When Breaking APIs")
>>> 
>>> 
>>> Thank you all.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan < cloud0fan@ gmail. com (
>>> cloud0...@gmail.com ) > wrote:
>>> 
>>> 
 OK let me put a proposal here:
 
 
 1. Permanently ban CHAR for native data source tables, and only keep it
 for Hive compatibility.
 It's OK to forget about padding like what Snowflake and MySQL have done.
 But it's hard for Spark to require consistent behavior about CHAR type in
 all data sources. Since CHAR type is not that useful nowadays, seems OK to
 just ban it. Another way is to document that the padding of CHAR type is
 data source dependent, but it's a bit weird to leave this inconsistency in
 Spark.
 
 
 2. Leave VARCHAR unchanged in 3.0
 VARCHAR type is so widely used in databases and it's weird if Spark
 doesn't support it. VARCHAR type is exactly the same as Spark StringType
 when the length limitation is not hit, and I'm fine to temporarily leave
 this flaw in 3.0 and users may hit behavior changes when the string values
 hit the VARCHAR length limitation.
 
 
 3. Finalize the VARCHAR behavior in 3.1
 For now I have 2 ideas:
 a) Make VARCHAR(x) a first-class data type. This means Spark data sources
 should support VARCHAR, and CREATE TABLE should fail if a column is
 VARCHAR type and the underlying data source doesn't support it (e.g.
 JSON/CSV). Type cast, type coercion, table insertion, etc. should be
 updated as well.
 b) Simply document that, t