Re-triggering failed GitHub workflows

2020-03-16 Thread Nicholas Chammas
Is there any way contributors can retrigger a failed GitHub workflow, like
we do with Jenkins? There's supposed to be a "Re-run all checks" button,
but I don't see it.

Do we need INFRA to grant permissions for that, perhaps?

Right now I'm doing it by adding empty commits:

```
git commit --allow-empty -m "re-trigger GitHub tests"
```

Nick


Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value
silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta 
wrote:

> Hi,
>
> 100% agree with Reynold.
>
>
> Regards,
> Gourav Sengupta
>
> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin  wrote:
>
>> Are we sure "not padding" is "incorrect"?
>>
>> I don't know whether ANSI SQL actually requires padding, but plenty of
>> databases don't actually pad.
>>
>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>> 
>>  :
>> "Snowflake currently deviates from common CHAR semantics in that strings
>> shorter than the maximum length are not space-padded at the end."
>>
>> MySQL:
>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Reynold.
>>>
>>> Please see the following for the context.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-31136
>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax"
>>>
>>> I raised the above issue according to the new rubric, and the banning
>>> was the proposed alternative to reduce the potential issue.
>>>
>>> Please give us your opinion since it's still PR.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin  wrote:
>>>
 I don’t understand this change. Wouldn’t this “ban” confuse the hell
 out of both new and old users?

 For old users, their old code that was working for char(3) would now
 stop working.

 For new users, depending on whether the underlying metastore char(3) is
 either supported but different from ansi Sql (which is not that big of a
 deal if we explain it) or not supported.

 On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Apache Spark has been suffered from a known consistency issue on
> `CHAR` type behavior among its usages and configurations. However, the
> evolution direction has been gradually moving forward to be consistent
> inside Apache Spark because we don't have `CHAR` offically. The following
> is the summary.
>
> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
> result.
> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
> Hive behavior.)
>
> spark-sql> CREATE TABLE t1(a CHAR(3));
> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>
> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a   3
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> Since 2.4.0, `STORED AS ORC` became consistent.
> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to
> Hive behavior.)
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a 2
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
> became consistent.
> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
> fallback to Hive behavior.)
>
> spark-sql> SELECT a, length(a) FROM t1;
> a 2
> spark-sql> SELECT a, length(a) FROM t2;
> a 2
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in
> the following syntax to be safe.
>
> CREATE TABLE t(a CHAR(3));
> https://github.com/apache/spark/pull/27902
>
> This email is sent out to inform you based on the new policy we voted.
> The recommendation is always using Apache Spark's native type `String`.
>
> Bests,
> Dongjoon.
>
> References:
> 1. "CHAR implementation?", 2017/09/15
>
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
> TABLE syntax", 2019/12/06
>
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>

>>


Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
I haven't spent enough time thinking about it to give a strong opinion, but 
this is of course very different from TRIM.

TRIM is a publicly documented function with two arguments, and we silently 
swapped the two arguments. And trim is also quite commonly used from a long 
time ago.

CHAR is an undocumented data type without clearly defined semantics. It's not 
great that we are changing the value here, but the value is already fucked up. 
It depends on the underlying data source, and random configs that are seemingly 
unrelated (orc) would impact the value.

On Mon, Mar 16, 2020 at 4:01 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Reynold.
> (And +Michael Armbrust)
> 
> 
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
> 
> 
> > Are we sure "not padding" is "incorrect"?
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> 
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
> com ( gourav.sengu...@gmail.com ) > wrote:
> 
> 
>> Hi,
>> 
>> 
>> 100% agree with Reynold.
>> 
>> 
>> 
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> 
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>> 
>>> 
>>> 
>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. 
>>> html
>>> (
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.
>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>> strings shorter than the maximum length are not space-padded at the end."
>>> 
>>> 
>>> 
>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ 
>>> why-char-dont-have-padding-in-mysql
>>> (
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> )
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
 Hi, Reynold.
 
 
 Please see the following for the context.
 
 
 https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
 https://issues.apache.org/jira/browse/SPARK-31136 )
 "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
 syntax"
 
 
 I raised the above issue according to the new rubric, and the banning was
 the proposed alternative to reduce the potential issue.
 
 
 Please give us your opinion since it's still PR.
 
 
 Bests,
 Dongjoon.
 
 On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
 r...@databricks.com ) > wrote:
 
 
> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
> of both new and old users?
> 
> 
> For old users, their old code that was working for char(3) would now stop
> working. 
> 
> 
> For new users, depending on whether the underlying metastore char(3) is
> either supported but different from ansi Sql (which is not that big of a
> deal if we explain it) or not supported. 
> 
> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
> ( dongjoon.h...@gmail.com ) > wrote:
> 
> 
>> Hi, All.
>> 
>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>> type behavior among its usages and configurations. However, the evolution
>> direction has been gradually moving forward to be consistent inside 
>> Apache
>> Spark because we don't have `CHAR` offically. The following is the
>> summary.
>> 
>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>> Hive behavior.)
>> 
>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>> 
>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>> 
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>> 
>> Since 2.4.0, `STORED AS ORC` became consistent.
>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>> behavior.)
>> 
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a   3
>>     spark-

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Stephen Coy
Hi there,

I’m kind of new around here, but I have had experience with all of all the so 
called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as 
well as Postgresql.

They all support the notion of “ANSI padding” for CHAR columns - which means 
that such columns are always space padded, and they default to having this 
enabled (for ANSI compliance).

MySQL also supports it, but it defaults to leaving it disabled for historical 
reasons not unlike what we have here.

In my opinion we should push toward standards compliance where possible and 
then document where it cannot work.

If users don’t like the padding on CHAR columns then they should change to 
VARCHAR - I believe that was its purpose in the first place, and it does not 
dictate any sort of “padding".

I can see why you might “ban” the use of CHAR columns where they cannot be 
consistently supported, but VARCHAR is a different animal and I would expect it 
to work consistently everywhere.


Cheers,

Steve C

On 17 Mar 2020, at 10:01 am, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value 
silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin 
mailto:r...@databricks.com>> wrote:

Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of 
databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
 : "Snowflake currently deviates from common CHAR semantics in that strings 
shorter than the maximum length are not space-padded at the end."

MySQL: 
https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql








On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the 
proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin 
mailto:r...@databricks.com>> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of 
both new and old users?

For old users, their old code that was working for char(3) would now stop 
working.

For new users, depending on whether the underlying metastore char(3) is either 
supported but different from ansi Sql (which is not that big of a deal if we 
explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type 
behavior among its usages and configurations. However, the evolution direction 
has been gradually moving forward to be consistent inside Apache Spark because 
we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive 
behavior.)

spark-sql> CREATE TABLE t1(a CHAR(3));
spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
spark-sql> INSERT INTO TABLE 

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Thank you, Stephen and Reynold.

To Reynold.

The way I see the following is a little different.

  > CHAR is an undocumented data type without clearly defined semantics.

Let me describe in Apache Spark User's View point.

Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
Apache Spark 1.x without much documentation. In addition, there still
exists an effort which is trying to keep it in 3.0.0 age.

   https://issues.apache.org/jira/browse/SPARK-31088
   Add back HiveContext and createExternalTable

Historically, we tried to make many SQL-based customer migrate their
workloads from Apache Hive into Apache Spark through `HiveContext`.

Although Apache Spark didn't have a good document about the inconsistent
behavior among its data sources, Apache Hive has been providing its
documentation and many customers rely the behavior.

  -
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

At that time, frequently in on-prem Hadoop clusters by well-known vendors,
many existing huge tables were created by Apache Hive, not Apache Spark.
And, Apache Spark is used for boosting SQL performance with its *caching*.
This was true because Apache Spark was added into the Hadoop-vendor
products later than Apache Hive.

Until the turning point at Apache Spark 2.0, we tried to catch up more
features to be consistent at least with Hive tables in Apache Hive and
Apache Spark because two SQL engines share the same tables.

For the following, technically, while Apache Hive doesn't changed its
existing behavior in this part, Apache Spark evolves inevitably by moving
away from the original Apache Spark old behaviors one-by-one.

  >  the value is already fucked up

The following is the change log.

  - When we switched the default value of `convertMetastoreParquet`.
(at Apache Spark 1.2)
  - When we switched the default value of `convertMetastoreOrc` (at
Apache Spark 2.4)
  - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
`PARQUET` table at Apache Spark 3.0)

To sum up, this has been a well-known issue in the community and among the
customers.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy  wrote:

> Hi there,
>
> I’m kind of new around here, but I have had experience with all of all the
> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
> Server as well as Postgresql.
>
> They all support the notion of “ANSI padding” for CHAR columns - which
> means that such columns are always space padded, and they default to having
> this enabled (for ANSI compliance).
>
> MySQL also supports it, but it defaults to leaving it disabled for
> historical reasons not unlike what we have here.
>
> In my opinion we should push toward standards compliance where possible
> and then document where it cannot work.
>
> If users don’t like the padding on CHAR columns then they should change to
> VARCHAR - I believe that was its purpose in the first place, and it does
> not dictate any sort of “padding".
>
> I can see why you might “ban” the use of CHAR columns where they cannot be
> consistently supported, but VARCHAR is a different animal and I would
> expect it to work consistently everywhere.
>
>
> Cheers,
>
> Steve C
>
> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun 
> wrote:
>
> Hi, Reynold.
> (And +Michael Armbrust)
>
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>
> > Are we sure "not padding" is "incorrect"?
>
> Bests,
> Dongjoon.
>
>
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> 100% agree with Reynold.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin  wrote:
>>
>>> Are we sure "not padding" is "incorrect"?
>>>
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>>
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>> 
>>>  :
>>> "Snowflake currently deviates from common CHAR semantics in that strings
>>> shorter than the maximum length are not space-padded at the end."
>>>
>>> MySQL:
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> 

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
I looked up our usage logs (sorry I can't share this publicly) and trim has at 
least four orders of magnitude higher usage than char.

On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you, Stephen and Reynold.
> 
> 
> To Reynold.
> 
> 
> The way I see the following is a little different.
> 
> 
>       > CHAR is an undocumented data type without clearly defined
> semantics.
> 
> Let me describe in Apache Spark User's View point.
> 
> 
> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
> Apache Spark 1.x without much documentation. In addition, there still
> exists an effort which is trying to keep it in 3.0.0 age.
> 
>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
> https://issues.apache.org/jira/browse/SPARK-31088 )
>        Add back HiveContext and createExternalTable
> 
> Historically, we tried to make many SQL-based customer migrate their
> workloads from Apache Hive into Apache Spark through `HiveContext`.
> 
> Although Apache Spark didn't have a good document about the inconsistent
> behavior among its data sources, Apache Hive has been providing its
> documentation and many customers rely the behavior.
> 
>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
> LanguageManual+Types
> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
> 
> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
> many existing huge tables were created by Apache Hive, not Apache Spark.
> And, Apache Spark is used for boosting SQL performance with its *caching*.
> This was true because Apache Spark was added into the Hadoop-vendor
> products later than Apache Hive.
> 
> 
> Until the turning point at Apache Spark 2.0, we tried to catch up more
> features to be consistent at least with Hive tables in Apache Hive and
> Apache Spark because two SQL engines share the same tables.
> 
> For the following, technically, while Apache Hive doesn't changed its
> existing behavior in this part, Apache Spark evolves inevitably by moving
> away from the original Apache Spark old behaviors one-by-one.
> 
> 
>       >  the value is already fucked up
> 
> 
> The following is the change log.
> 
>       - When we switched the default value of `convertMetastoreParquet`.
> (at Apache Spark 1.2)
>       - When we switched the default value of `convertMetastoreOrc` (at
> Apache Spark 2.4)
>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
> `PARQUET` table at Apache Spark 3.0)
> 
> To sum up, this has been a well-known issue in the community and among the
> customers.
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
> s...@infomedia.com.au ) > wrote:
> 
> 
>> Hi there,
>> 
>> 
>> I’m kind of new around here, but I have had experience with all of all the
>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>> Server as well as Postgresql.
>> 
>> 
>> They all support the notion of “ANSI padding” for CHAR columns - which
>> means that such columns are always space padded, and they default to
>> having this enabled (for ANSI compliance).
>> 
>> 
>> MySQL also supports it, but it defaults to leaving it disabled for
>> historical reasons not unlike what we have here.
>> 
>> 
>> In my opinion we should push toward standards compliance where possible
>> and then document where it cannot work.
>> 
>> 
>> If users don’t like the padding on CHAR columns then they should change to
>> VARCHAR - I believe that was its purpose in the first place, and it does
>> not dictate any sort of “padding".
>> 
>> 
>> I can see why you might “ban” the use of CHAR columns where they cannot be
>> consistently supported, but VARCHAR is a different animal and I would
>> expect it to work consistently everywhere.
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> Steve C
>> 
>> 
>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>> dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> Hi, Reynold.
>>> (And +Michael Armbrust)
>>> 
>>> 
>>> If you think so, do you think it's okay that we change the return value
>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>> 
>>> 
>>> > Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
>>> com ( gourav.sengu...@gmail.com ) > wrote:
>>> 
>>> 
 Hi,
 
 
 100% agree with Reynold.
 
 
 
 
 Regards,
 Gourav Sengupta
 
 
 On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
 r...@databricks.com ) > wrote:
 
 
> 
> Are we sure "not padding" is "incorrect"?
> 
> 
> 
> I don't know whether ANSI SQL actually requires padding, but plenty of
> databases don't actually pad.
> 
> 
> 
> https:/ / docs. snowflake. net/ manuals/ sql-refer

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was 
merely pointing out that if we deviate away from SQL standard in any way we are 
considered "wrong" or "incorrect". That argument itself is flawed when plenty 
of other popular database systems also deviate away from the standard on this 
specific behavior.

On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < r...@databricks.com > wrote:

> 
> I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> 
> 
> 
> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
> ( dongjoon.h...@gmail.com ) > wrote:
> 
>> Thank you, Stephen and Reynold.
>> 
>> 
>> To Reynold.
>> 
>> 
>> The way I see the following is a little different.
>> 
>> 
>>       > CHAR is an undocumented data type without clearly defined
>> semantics.
>> 
>> Let me describe in Apache Spark User's View point.
>> 
>> 
>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>> Apache Spark 1.x without much documentation. In addition, there still
>> exists an effort which is trying to keep it in 3.0.0 age.
>> 
>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>        Add back HiveContext and createExternalTable
>> 
>> Historically, we tried to make many SQL-based customer migrate their
>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>> 
>> Although Apache Spark didn't have a good document about the inconsistent
>> behavior among its data sources, Apache Hive has been providing its
>> documentation and many customers rely the behavior.
>> 
>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
>> LanguageManual+Types
>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>> 
>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>> many existing huge tables were created by Apache Hive, not Apache Spark.
>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>> This was true because Apache Spark was added into the Hadoop-vendor
>> products later than Apache Hive.
>> 
>> 
>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>> features to be consistent at least with Hive tables in Apache Hive and
>> Apache Spark because two SQL engines share the same tables.
>> 
>> For the following, technically, while Apache Hive doesn't changed its
>> existing behavior in this part, Apache Spark evolves inevitably by moving
>> away from the original Apache Spark old behaviors one-by-one.
>> 
>> 
>>       >  the value is already fucked up
>> 
>> 
>> The following is the change log.
>> 
>>       - When we switched the default value of `convertMetastoreParquet`.
>> (at Apache Spark 1.2)
>>       - When we switched the default value of `convertMetastoreOrc` (at
>> Apache Spark 2.4)
>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>> `PARQUET` table at Apache Spark 3.0)
>> 
>> To sum up, this has been a well-known issue in the community and among the
>> customers.
>> 
>> Bests,
>> Dongjoon.
>> 
>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
>> s...@infomedia.com.au ) > wrote:
>> 
>> 
>>> Hi there,
>>> 
>>> 
>>> I’m kind of new around here, but I have had experience with all of all the
>>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>>> Server as well as Postgresql.
>>> 
>>> 
>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>> means that such columns are always space padded, and they default to
>>> having this enabled (for ANSI compliance).
>>> 
>>> 
>>> MySQL also supports it, but it defaults to leaving it disabled for
>>> historical reasons not unlike what we have here.
>>> 
>>> 
>>> In my opinion we should push toward standards compliance where possible
>>> and then document where it cannot work.
>>> 
>>> 
>>> If users don’t like the padding on CHAR columns then they should change to
>>> VARCHAR - I believe that was its purpose in the first place, and it does
>>> not dictate any sort of “padding".
>>> 
>>> 
>>> I can see why you might “ban” the use of CHAR columns where they cannot be
>>> consistently supported, but VARCHAR is a different animal and I would
>>> expect it to work consistently everywhere.
>>> 
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> Steve C
>>> 
>>> 
 On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
 dongjoon.h...@gmail.com ) > wrote:
 
 Hi, Reynold.
 (And +Michael Armbrust)
 
 
 If you think so, do you think it's okay that we change the return value
 silently? Then, I'm wondering why we reverted `TRIM` functions then?
 
 
 > Are we sure "not padding" is "incorrect"?
 
 
 
 Bests,
 Dongjoon.
 
 
 
 On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengu

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Ur, are you comparing the number of SELECT statement with TRIM and CREATE
statements with `CHAR`?

> I looked up our usage logs (sorry I can't share this publicly) and trim
has at least four orders of magnitude higher usage than char.

We need to discuss more about what to do. This thread is what I expected
exactly. :)

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
I was merely pointing out that if we deviate away from SQL standard in any
way we are considered "wrong" or "incorrect". That argument itself is
flawed when plenty of other popular database systems also deviate away from
the standard on this specific behavior.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin  wrote:

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
> I was merely pointing out that if we deviate away from SQL standard in any
> way we are considered "wrong" or "incorrect". That argument itself is
> flawed when plenty of other popular database systems also deviate away from
> the standard on this specific behavior.
>
>
>
>
> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin  wrote:
>
>> I looked up our usage logs (sorry I can't share this publicly) and trim
>> has at least four orders of magnitude higher usage than char.
>>
>>
>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Stephen and Reynold.
>>>
>>> To Reynold.
>>>
>>> The way I see the following is a little different.
>>>
>>>   > CHAR is an undocumented data type without clearly defined
>>> semantics.
>>>
>>> Let me describe in Apache Spark User's View point.
>>>
>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function)
>>> at Apache Spark 1.x without much documentation. In addition, there still
>>> exists an effort which is trying to keep it in 3.0.0 age.
>>>
>>>https://issues.apache.org/jira/browse/SPARK-31088
>>>Add back HiveContext and createExternalTable
>>>
>>> Historically, we tried to make many SQL-based customer migrate their
>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>
>>> Although Apache Spark didn't have a good document about the inconsistent
>>> behavior among its data sources, Apache Hive has been providing its
>>> documentation and many customers rely the behavior.
>>>
>>>   -
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
>>>
>>> At that time, frequently in on-prem Hadoop clusters by well-known
>>> vendors, many existing huge tables were created by Apache Hive, not Apache
>>> Spark. And, Apache Spark is used for boosting SQL performance with its
>>> *caching*. This was true because Apache Spark was added into the
>>> Hadoop-vendor products later than Apache Hive.
>>>
>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>> features to be consistent at least with Hive tables in Apache Hive and
>>> Apache Spark because two SQL engines share the same tables.
>>>
>>> For the following, technically, while Apache Hive doesn't changed its
>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>> away from the original Apache Spark old behaviors one-by-one.
>>>
>>>   >  the value is already fucked up
>>>
>>> The following is the change log.
>>>
>>>   - When we switched the default value of `convertMetastoreParquet`.
>>> (at Apache Spark 1.2)
>>>   - When we switched the default value of `convertMetastoreOrc` (at
>>> Apache Spark 2.4)
>>>   - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>>> `PARQUET` table at Apache Spark 3.0)
>>>
>>> To sum up, this has been a well-known issue in the community and among
>>> the customers.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy 
>>> wrote:
>>>
 Hi there,

 I’m kind of new around here, but I have had experience with all of all
 the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft
 SQL Server as well as Postgresql.

 They all support the notion of “ANSI padding” for CHAR columns - which
 means that such columns are always space padded, and they default to having
 this enabled (for ANSI compliance).

 MySQL also supports it, but it defaults to leaving it disabled for
 historical reasons not unlike what we have here.

 In my opinion we should push toward standards compliance where possible
 and then document where it cannot work.

 If users don’t like the padding on CHAR columns then they should change
 to VARCHAR - I believe that was its purpose in the first place, and it does
 not dictate any sort of “padding".

 I can see why you might “ban” the use of CHAR columns where they cannot
 be consistently supported, but VARCHAR is a different animal and I would
 expect it to work consistently everywhere.


 Cheers,

 Steve C

 On 17 Mar 2020, at 10:01 am, Dongjoon Hyun 
 wrote:

 H

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
−User

char barely showed up (honestly negligible). I was comparing select vs select.

On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
> statements with `CHAR`?
> 
> > I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> We need to discuss more about what to do. This thread is what I expected
> exactly. :)
> 
> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
> it). I was merely pointing out that if we deviate away from SQL standard
> in any way we are considered "wrong" or "incorrect". That argument itself
> is flawed when plenty of other popular database systems also deviate away
> from the standard on this specific behavior.
> 
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
>> I was merely pointing out that if we deviate away from SQL standard in any
>> way we are considered "wrong" or "incorrect". That argument itself is
>> flawed when plenty of other popular database systems also deviate away
>> from the standard on this specific behavior.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>>> I looked up our usage logs (sorry I can't share this publicly) and trim
>>> has at least four orders of magnitude higher usage than char.
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
 Thank you, Stephen and Reynold.
 
 
 To Reynold.
 
 
 The way I see the following is a little different.
 
 
       > CHAR is an undocumented data type without clearly defined
 semantics.
 
 Let me describe in Apache Spark User's View point.
 
 
 Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
 Apache Spark 1.x without much documentation. In addition, there still
 exists an effort which is trying to keep it in 3.0.0 age.
 
        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
 https://issues.apache.org/jira/browse/SPARK-31088 )
        Add back HiveContext and createExternalTable
 
 Historically, we tried to make many SQL-based customer migrate their
 workloads from Apache Hive into Apache Spark through `HiveContext`.
 
 Although Apache Spark didn't have a good document about the inconsistent
 behavior among its data sources, Apache Hive has been providing its
 documentation and many customers rely the behavior.
 
       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ 
 LanguageManual+Types
 ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
 
 At that time, frequently in on-prem Hadoop clusters by well-known vendors,
 many existing huge tables were created by Apache Hive, not Apache Spark.
 And, Apache Spark is used for boosting SQL performance with its *caching*.
 This was true because Apache Spark was added into the Hadoop-vendor
 products later than Apache Hive.
 
 
 Until the turning point at Apache Spark 2.0, we tried to catch up more
 features to be consistent at least with Hive tables in Apache Hive and
 Apache Spark because two SQL engines share the same tables.
 
 For the following, technically, while Apache Hive doesn't changed its
 existing behavior in this part, Apache Spark evolves inevitably by moving
 away from the original Apache Spark old behaviors one-by-one.
 
 
       >  the value is already fucked up
 
 
 The following is the change log.
 
       - When we switched the default value of `convertMetastoreParquet`.
 (at Apache Spark 1.2)
       - When we switched the default value of `convertMetastoreOrc` (at
 Apache Spark 2.4)
       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
 `PARQUET` table at Apache Spark 3.0)
 
 To sum up, this has been a well-known issue in the community and among the
 customers.
 
 Bests,
 Dongjoon.
 
 On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
 s...@infomedia.com.au ) > wrote:
 
 
> Hi there,
> 
> 
> I’m kind of new around here, but I have had experience with all of all the
> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
> Server as well as Postgresql.
> 
> 
> They all support the notion of “ANSI padding” for CHAR columns - which
> means that such columns are always space padded, and they default to
> having this enabled (for ANSI compl

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Thank you for sharing and confirming.

We had better consider all heterogeneous customers in the world. And, I
also have experiences with the non-negligible cases in on-prem.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin  wrote:

> −User
>
> char barely showed up (honestly negligible). I was comparing select vs
> select.
>
>
>
> On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun 
> wrote:
>
>> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
>> statements with `CHAR`?
>>
>> > I looked up our usage logs (sorry I can't share this publicly) and trim
>> has at least four orders of magnitude higher usage than char.
>>
>> We need to discuss more about what to do. This thread is what I expected
>> exactly. :)
>>
>> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
>> it). I was merely pointing out that if we deviate away from SQL standard in
>> any way we are considered "wrong" or "incorrect". That argument itself is
>> flawed when plenty of other popular database systems also deviate away from
>> the standard on this specific behavior.
>>
>> Bests,
>> Dongjoon.
>>
>> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin  wrote:
>>
>>> BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>> it). I was merely pointing out that if we deviate away from SQL standard in
>>> any way we are considered "wrong" or "incorrect". That argument itself is
>>> flawed when plenty of other popular database systems also deviate away from
>>> the standard on this specific behavior.
>>>
>>>
>>>
>>>
>>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin 
>>> wrote:
>>>
 I looked up our usage logs (sorry I can't share this publicly) and trim
 has at least four orders of magnitude higher usage than char.


 On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun >>> > wrote:

> Thank you, Stephen and Reynold.
>
> To Reynold.
>
> The way I see the following is a little different.
>
>   > CHAR is an undocumented data type without clearly defined
> semantics.
>
> Let me describe in Apache Spark User's View point.
>
> Apache Spark started to claim `HiveContext` (and `hql/hiveql`
> function) at Apache Spark 1.x without much documentation. In addition,
> there still exists an effort which is trying to keep it in 3.0.0 age.
>
>https://issues.apache.org/jira/browse/SPARK-31088
>Add back HiveContext and createExternalTable
>
> Historically, we tried to make many SQL-based customer migrate their
> workloads from Apache Hive into Apache Spark through `HiveContext`.
>
> Although Apache Spark didn't have a good document about the
> inconsistent behavior among its data sources, Apache Hive has been
> providing its documentation and many customers rely the behavior.
>
>   -
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
>
> At that time, frequently in on-prem Hadoop clusters by well-known
> vendors, many existing huge tables were created by Apache Hive, not Apache
> Spark. And, Apache Spark is used for boosting SQL performance with its
> *caching*. This was true because Apache Spark was added into the
> Hadoop-vendor products later than Apache Hive.
>
> Until the turning point at Apache Spark 2.0, we tried to catch up more
> features to be consistent at least with Hive tables in Apache Hive and
> Apache Spark because two SQL engines share the same tables.
>
> For the following, technically, while Apache Hive doesn't changed its
> existing behavior in this part, Apache Spark evolves inevitably by moving
> away from the original Apache Spark old behaviors one-by-one.
>
>   >  the value is already fucked up
>
> The following is the change log.
>
>   - When we switched the default value of
> `convertMetastoreParquet`. (at Apache Spark 1.2)
>   - When we switched the default value of `convertMetastoreOrc`
> (at Apache Spark 2.4)
>   - When we switched `CREATE TABLE` itself. (Change `TEXT` table
> to `PARQUET` table at Apache Spark 3.0)
>
> To sum up, this has been a well-known issue in the community and among
> the customers.
>
> Bests,
> Dongjoon.
>
> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy 
> wrote:
>
>> Hi there,
>>
>> I’m kind of new around here, but I have had experience with all of
>> all the so called “big iron” databases such as Oracle, IBM DB2 and
>> Microsoft SQL Server as well as Postgresql.
>>
>> They all support the notion of “ANSI padding” for CHAR columns -
>> which means that such columns are always space padded, and they default 
>> to
>> having this enabled (for ANSI compliance).
>>
>> MySQL also supports it, but it defaults to leaving it disabled for
>> historical reasons not unlike what we have here.

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
For sure.

There's another reason I feel char is not that important and it's more 
important to be internally consistent (e.g. all data sources support it with 
the same behavior, vs one data sources do one behavior and another do the 
other). char was created at a time when cpu was slow and storage was expensive, 
and being able to pack things nicely at fixed length was highly useful. The 
fact that it was padded was initially done for performance, not for the padding 
itself. A lot has changed since char was invented, and with modern technologies 
(columnar, dictionary encoding, etc) there is little reason to use a char data 
type for anything. As a matter of fact, Spark internally converts char type to 
string to work with.

I see two solutions really.

1. We require padding, and ban all uses of char when it is not properly padded. 
This would ban all the native data sources, which are the primarily way people 
are using Spark. This leaves only char support for tables going through Hive 
serdes, which are slow to begin with. It is basically Dongjoon and Wenchen's 
suggestion. This turns char support into a compatibility feature only for some 
Hive tables that cannot be converted into Spark native data sources. This has 
confusing end-user behavior because depending on whether that Hive table is 
converted into Spark native data sources, we might or might not support char 
type.

An extension to the above is to introduce padding for char type across the 
board, and make char type a first class data type. There are a lot of work to 
introduce another data type, especially for one that has virtually no usage ( 
https://trends.google.com/trends/explore?geo=US&q=hive%20char,hive%20string ) 
and its usage will likely continue to decline in the future (just reason from 
first principle based on the reason char was introduced in the first place).

Now I'm assuming it's a lot of work to do char properly. But if it is not the 
case (e.g. just a simple rule to insert padding at planning time), then maybe 
it's worth doing it this way. I'm totally OK with this too.

What I'd oppose is to just ban char for the native data sources, and do not 
have a plan to address this problem systematically.

2. Just forget about padding, like what Snowflake and MySQL have done. Document 
that char(x) is just an alias for string. And then move on. Almost no work 
needs to be done...

On Mon, Mar 16, 2020 at 5:54 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you for sharing and confirming.
> 
> 
> We had better consider all heterogeneous customers in the world. And, I
> also have experiences with the non-negligible cases in on-prem.
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> −User
>> 
>> 
>> 
>> char barely showed up (honestly negligible). I was comparing select vs
>> select.
>> 
>> 
>> 
>> 
>> 
>> On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.h...@gmail.com ) > wrote:
>> 
>>> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
>>> statements with `CHAR`?
>>> 
>>> > I looked up our usage logs (sorry I can't share this publicly) and trim
>>> has at least four orders of magnitude higher usage than char.
>>> 
>>> We need to discuss more about what to do. This thread is what I expected
>>> exactly. :)
>>> 
>>> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>> it). I was merely pointing out that if we deviate away from SQL standard
>>> in any way we are considered "wrong" or "incorrect". That argument itself
>>> is flawed when plenty of other popular database systems also deviate away
>>> from the standard on this specific behavior.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com (
>>> r...@databricks.com ) > wrote:
>>> 
>>> 
 BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
 I was merely pointing out that if we deviate away from SQL standard in any
 way we are considered "wrong" or "incorrect". That argument itself is
 flawed when plenty of other popular database systems also deviate away
 from the standard on this specific behavior.
 
 
 
 
 
 
 
 
 On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < rxin@ databricks. com (
 r...@databricks.com ) > wrote:
 
> I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> 
> 
> 
> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. 
> com
> ( dongjoon.h...@gmail.com ) > wrote:
> 
>> Thank you, Stephen and Reynold.
>> 
>> 
>> To Reynold.
>> 
>> 
>> The way I see the following is a little different.
>> 
>> 
>>       > CHAR is an undocumented data type wi

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Stephen Coy
I don’t think I can recall any usages of type CHAR in any situation.

Really, it’s only use (on any traditional SQL database) would be when you 
*want* a fixed width character column that has been right padded with spaces.


On 17 Mar 2020, at 12:13 pm, Reynold Xin 
mailto:r...@databricks.com>> wrote:


For sure.

There's another reason I feel char is not that important and it's more 
important to be internally consistent (e.g. all data sources support it with 
the same behavior, vs one data sources do one behavior and another do the 
other). char was created at a time when cpu was slow and storage was expensive, 
and being able to pack things nicely at fixed length was highly useful. The 
fact that it was padded was initially done for performance, not for the padding 
itself. A lot has changed since char was invented, and with modern technologies 
(columnar, dictionary encoding, etc) there is little reason to use a char data 
type for anything. As a matter of fact, Spark internally converts char type to 
string to work with.


I see two solutions really.

1. We require padding, and ban all uses of char when it is not properly padded. 
This would ban all the native data sources, which are the primarily way people 
are using Spark. This leaves only char support for tables going through Hive 
serdes, which are slow to begin with. It is basically Dongjoon and Wenchen's 
suggestion. This turns char support into a compatibility feature only for some 
Hive tables that cannot be converted into Spark native data sources. This has 
confusing end-user behavior because depending on whether that Hive table is 
converted into Spark native data sources, we might or might not support char 
type.

An extension to the above is to introduce padding for char type across the 
board, and make char type a first class data type. There are a lot of work to 
introduce another data type, especially for one that has virtually no 
usage
 and its usage will likely continue to decline in the future (just reason from 
first principle based on the reason char was introduced in the first place).

Now I'm assuming it's a lot of work to do char properly. But if it is not the 
case (e.g. just a simple rule to insert padding at planning time), then maybe 
it's worth doing it this way. I'm totally OK with this too.

What I'd oppose is to just ban char for the native data sources, and do not 
have a plan to address this problem systematically.


2. Just forget about padding, like what Snowflake and MySQL have done. Document 
that char(x) is just an alias for string. And then move on. Almost no work 
needs to be done...







On Mon, Mar 16, 2020 at 5:54 PM, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Thank you for sharing and confirming.

We had better consider all heterogeneous customers in the world. And, I also 
have experiences with the non-negligible cases in on-prem.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
−User

char barely showed up (honestly negligible). I was comparing select vs select.



On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Ur, are you comparing the number of SELECT statement with TRIM and CREATE 
statements with `CHAR`?

> I looked up our usage logs (sorry I can't share this publicly) and trim has 
> at least four orders of magnitude higher usage than char.

We need to discuss more about what to do. This thread is what I expected 
exactly. :)

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I 
> was merely pointing out that if we deviate away from SQL standard in any way 
> we are considered "wrong" or "incorrect". That argument itself is flawed when 
> plenty of other popular database systems also deviate away from the standard 
> on this specific behavior.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was 
merely pointing out that if we deviate away from SQL standard in any way we are 
considered "wrong" or "incorrect". That argument itself is flawed when plenty 
of other popular database systems also deviate away from the standard on this 
specific behavior.




On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
I looked up our usage logs (sorry I can't share this publicly) and trim has at 
least four orders of magnitude higher usage than char.


On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Thank you, Stephen and Reynold.

To Reynold.

The way I see the following is a little different.

  > CHAR is an undocumented data type without clearly defined semantics.

Let me describe in Apache Spark User's View point.

Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at 
Apache Spark 1.x

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-16 Thread wuyi
Thanks Sean and Takeshi.

Option 1 seems really impossible. And I'm going to take Option 2 as an
alternative choice.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-16 Thread Wenchen Fan
I don't think option 1 is possible.

For option 2: I think we need to do it anyway. It's kind of a bug that the
typed Scala UDF doesn't support case class that thus can't support
struct-type input columns.

For option 3: It's a bit risky to add a new API but seems like we have a
good reason. The untyped Scala UDF supports Row as input/output, which is a
valid use case to me. It requires a "returnType" parameter, but not input
types. This brings 2 problems: 1) if the UDF parameter is primitive-type
but the actual value is null, the result will be wrong. 2) Analyzer can't
do type check for the UDF.

Maybe we can add a new method def udf(f: AnyRef, inputTypes: Seq[(DataType,
Boolean)], returnType: DataType), to allow users to specify the expected
input data types and nullablilities.

On Tue, Mar 17, 2020 at 1:58 PM wuyi  wrote:

> Thanks Sean and Takeshi.
>
> Option 1 seems really impossible. And I'm going to take Option 2 as an
> alternative choice.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>