Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-17 Thread Hyukjin Kwon
Option 2 seems fine to me.

2020년 3월 17일 (화) 오후 3:41, Wenchen Fan 님이 작성:

> I don't think option 1 is possible.
>
> For option 2: I think we need to do it anyway. It's kind of a bug that the
> typed Scala UDF doesn't support case class that thus can't support
> struct-type input columns.
>
> For option 3: It's a bit risky to add a new API but seems like we have a
> good reason. The untyped Scala UDF supports Row as input/output, which is a
> valid use case to me. It requires a "returnType" parameter, but not input
> types. This brings 2 problems: 1) if the UDF parameter is primitive-type
> but the actual value is null, the result will be wrong. 2) Analyzer can't
> do type check for the UDF.
>
> Maybe we can add a new method def udf(f: AnyRef, inputTypes:
> Seq[(DataType, Boolean)], returnType: DataType), to allow users to
> specify the expected input data types and nullablilities.
>
> On Tue, Mar 17, 2020 at 1:58 PM wuyi  wrote:
>
>> Thanks Sean and Takeshi.
>>
>> Option 1 seems really impossible. And I'm going to take Option 2 as an
>> alternative choice.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
I agree that Spark can define the semantic of CHAR(x) differently than the
SQL standard (no padding), and ask the data sources to follow it. But the
problem is, some data sources may not be able to skip padding, like the
Hive serde table.

On the other hand, it's easier to require padding for CHAR(x). Even if some
data sources don't support padding, Spark can simply do the padding at the
read time, using the `rpad` function. However, if CHAR(x) is rarely used,
maybe we should just ban it and only keep it for Hive compatibility, to
save our work.

VARCHAR(x) is a different story as it's a commonly used data type in
databases. It has a length limitation which can help the backed engine to
make better decisions when dealing with it. Currently Spark just treats
VARCHAR(x) as string type, which works fine in most cases, but different
data sources may have different behaviors during writing. For example,
pgsql JDBC data source fails the writing if length limitation is hit, Hive
serde table simply truncate the chars exceeding length limitation, Parquet
data source writes whatever string it gets.

We can just document that, the underlying data source may or may not
enforce the length limitation of VARCHAR(x). Or we can make VARCHAR(x) a
first-class data type, which requires a lot more changes (type coercion,
type cast, etc.).

Before we make a final decision, I think it's reasonable to ban CHAR/VARCHAR
in non-Hive-serde tables in 3.0, so that we don't introduce silent result
changing here.

Any ideas are welcome!

Thanks,
Wenchen

On Tue, Mar 17, 2020 at 11:29 AM Stephen Coy 
wrote:

> I don’t think I can recall any usages of type CHAR in any situation.
>
> Really, it’s only use (on any traditional SQL database) would be when you
> *want* a fixed width character column that has been right padded with
> spaces.
>
>
> On 17 Mar 2020, at 12:13 pm, Reynold Xin  wrote:
>
> For sure.
>
> There's another reason I feel char is not that important and it's more
> important to be internally consistent (e.g. all data sources support it
> with the same behavior, vs one data sources do one behavior and another do
> the other). char was created at a time when cpu was slow and storage was
> expensive, and being able to pack things nicely at fixed length was highly
> useful. The fact that it was padded was initially done for performance, not
> for the padding itself. A lot has changed since char was invented, and with
> modern technologies (columnar, dictionary encoding, etc) there is little
> reason to use a char data type for anything. As a matter of fact, Spark
> internally converts char type to string to work with.
>
>
> I see two solutions really.
>
> 1. We require padding, and ban all uses of char when it is not properly
> padded. This would ban all the native data sources, which are the primarily
> way people are using Spark. This leaves only char support for tables going
> through Hive serdes, which are slow to begin with. It is basically Dongjoon
> and Wenchen's suggestion. This turns char support into a compatibility
> feature only for some Hive tables that cannot be converted into Spark
> native data sources. This has confusing end-user behavior because depending
> on whether that Hive table is converted into Spark native data sources, we
> might or might not support char type.
>
> An extension to the above is to introduce padding for char type across the
> board, and make char type a first class data type. There are a lot of work
> to introduce another data type, especially for one that has virtually no
> usage
>  
> and
> its usage will likely continue to decline in the future (just reason from
> first principle based on the reason char was introduced in the first place).
>
> Now I'm assuming it's a lot of work to do char properly. But if it is not
> the case (e.g. just a simple rule to insert padding at planning time), then
> maybe it's worth doing it this way. I'm totally OK with this too.
>
> What I'd oppose is to just ban char for the native data sources, and do
> not have a plan to address this problem systematically.
>
>
> 2. Just forget about padding, like what Snowflake and MySQL have done.
> Document that char(x) is just an alias for string. And then move on. Almost
> no work needs to be done...
>
>
>
>
>
>
>
> On Mon, Mar 16, 2020 at 5:54 PM, Dongjoon Hyun 
> wrote:
>
>> Thank you for sharing and confirming.
>>
>> We had better consider all heterogeneous customers in the world. And, I
>> also have experiences with the non-negligible cases in on-prem.
>>
>> Bests,
>> Dongjoon.
>>
>> On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin  wrote:
>>
>>> −User
>>>
>>> char barely showed up (honestly negligible). I was comparing select vs
>>> select.
>>>
>>>
>>>
>>> On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun 
>>> wrote:
>>>
 Ur, are you comparing the number of SELECT statement with TRIM and
 CREATE statements with `CHAR`?

 > I 

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Maryann Xue
It would be super weird not to support VARCHAR as SQL engine. Banning CHAR
is probably fine, as its semantics is genuinely confusing.
We can issue a warning when parsing VARCHAR with a limit and suggest the
usage of String instead.

On Tue, Mar 17, 2020 at 10:27 AM Wenchen Fan  wrote:

> I agree that Spark can define the semantic of CHAR(x) differently than
> the SQL standard (no padding), and ask the data sources to follow it. But
> the problem is, some data sources may not be able to skip padding, like the
> Hive serde table.
>
> On the other hand, it's easier to require padding for CHAR(x). Even if
> some data sources don't support padding, Spark can simply do the padding at
> the read time, using the `rpad` function. However, if CHAR(x) is rarely
> used, maybe we should just ban it and only keep it for Hive compatibility,
> to save our work.
>
> VARCHAR(x) is a different story as it's a commonly used data type in
> databases. It has a length limitation which can help the backed engine to
> make better decisions when dealing with it. Currently Spark just treats
> VARCHAR(x) as string type, which works fine in most cases, but different
> data sources may have different behaviors during writing. For example,
> pgsql JDBC data source fails the writing if length limitation is hit, Hive
> serde table simply truncate the chars exceeding length limitation, Parquet
> data source writes whatever string it gets.
>
> We can just document that, the underlying data source may or may not
> enforce the length limitation of VARCHAR(x). Or we can make VARCHAR(x) a
> first-class data type, which requires a lot more changes (type coercion,
> type cast, etc.).
>
> Before we make a final decision, I think it's reasonable to ban
> CHAR/VARCHAR in non-Hive-serde tables in 3.0, so that we don't introduce
> silent result changing here.
>
> Any ideas are welcome!
>
> Thanks,
> Wenchen
>
> On Tue, Mar 17, 2020 at 11:29 AM Stephen Coy 
> wrote:
>
>> I don’t think I can recall any usages of type CHAR in any situation.
>>
>> Really, it’s only use (on any traditional SQL database) would be when you
>> *want* a fixed width character column that has been right padded with
>> spaces.
>>
>>
>> On 17 Mar 2020, at 12:13 pm, Reynold Xin  wrote:
>>
>> For sure.
>>
>> There's another reason I feel char is not that important and it's more
>> important to be internally consistent (e.g. all data sources support it
>> with the same behavior, vs one data sources do one behavior and another do
>> the other). char was created at a time when cpu was slow and storage was
>> expensive, and being able to pack things nicely at fixed length was highly
>> useful. The fact that it was padded was initially done for performance, not
>> for the padding itself. A lot has changed since char was invented, and with
>> modern technologies (columnar, dictionary encoding, etc) there is little
>> reason to use a char data type for anything. As a matter of fact, Spark
>> internally converts char type to string to work with.
>>
>>
>> I see two solutions really.
>>
>> 1. We require padding, and ban all uses of char when it is not properly
>> padded. This would ban all the native data sources, which are the primarily
>> way people are using Spark. This leaves only char support for tables going
>> through Hive serdes, which are slow to begin with. It is basically Dongjoon
>> and Wenchen's suggestion. This turns char support into a compatibility
>> feature only for some Hive tables that cannot be converted into Spark
>> native data sources. This has confusing end-user behavior because depending
>> on whether that Hive table is converted into Spark native data sources, we
>> might or might not support char type.
>>
>> An extension to the above is to introduce padding for char type across
>> the board, and make char type a first class data type. There are a lot of
>> work to introduce another data type, especially for one that has virtually no
>> usage
>> 
>>  and
>> its usage will likely continue to decline in the future (just reason from
>> first principle based on the reason char was introduced in the first place).
>>
>> Now I'm assuming it's a lot of work to do char properly. But if it is not
>> the case (e.g. just a simple rule to insert padding at planning time), then
>> maybe it's worth doing it this way. I'm totally OK with this too.
>>
>> What I'd oppose is to just ban char for the native data sources, and do
>> not have a plan to address this problem systematically.
>>
>>
>> 2. Just forget about padding, like what Snowflake and MySQL have done.
>> Document that char(x) is just an alias for string. And then move on. Almost
>> no work needs to be done...
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Mar 16, 2020 at 5:54 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for sharing and confirming.
>>>
>>> We had better consider all heterogeneous customers in the world. And, I
>>> also have experiences with the 

Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-17 Thread Ben Roling
I tried this on the users mailing list but didn't get traction.  It's
probably more appropriate here anyway.

I've noticed that DataSet.sqlContext is public in Scala but the equivalent
(DataFrame._sc) in PySpark is named as if it should be treated as private.

Is this intentional?  If so, what's the rationale?  If not, then it feels
like a bug and DataFrame should have some form of public access back to the
context/session.  I'm happy to log the bug but thought I would ask here
first.  Thanks!


Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Michael Armbrust
>
> What I'd oppose is to just ban char for the native data sources, and do
> not have a plan to address this problem systematically.
>

+1


> Just forget about padding, like what Snowflake and MySQL have done.
> Document that char(x) is just an alias for string. And then move on. Almost
> no work needs to be done...
>

+1


Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
OK let me put a proposal here:

1. Permanently ban CHAR for native data source tables, and only keep it for
Hive compatibility.
It's OK to forget about padding like what Snowflake and MySQL have done.
But it's hard for Spark to require consistent behavior about CHAR type in
all data sources. Since CHAR type is not that useful nowadays, seems OK to
just ban it. Another way is to document that the padding of CHAR type is
data source dependent, but it's a bit weird to leave this inconsistency in
Spark.

2. Leave VARCHAR unchanged in 3.0
VARCHAR type is so widely used in databases and it's weird if Spark doesn't
support it. VARCHAR type is exactly the same as Spark StringType when the
length limitation is not hit, and I'm fine to temporarily leave this flaw
in 3.0 and users may hit behavior changes when the string values hit the
VARCHAR length limitation.

3. Finalize the VARCHAR behavior in 3.1
For now I have 2 ideas:
a) Make VARCHAR(x) a first-class data type. This means Spark data sources
should support VARCHAR, and CREATE TABLE should fail if a column is VARCHAR
type and the underlying data source doesn't support it (e.g. JSON/CSV).
Type cast, type coercion, table insertion, etc. should be updated as well.
b) Simply document that, the underlying data source may or may not enforce
the length limitation of VARCHAR(x).

Please let me know if you have different ideas.

Thanks,
Wenchen

On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust 
wrote:

> What I'd oppose is to just ban char for the native data sources, and do
>> not have a plan to address this problem systematically.
>>
>
> +1
>
>
>> Just forget about padding, like what Snowflake and MySQL have done.
>> Document that char(x) is just an alias for string. And then move on. Almost
>> no work needs to be done...
>>
>
> +1
>
>