Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

André Mello Mon, 04 Mar 2019 06:01:34 -0800

Hey everyone,

Progress has been made with PR 
#23882<https://github.com/apache/spark/pull/23882>, and it is now in a state 
where it could be merged with master.


This is what we’re doing for now:

  1.  PySpark will support strings consistently throughout its API.
     *   Arguably string support makes syntax closer to SQL and Scala, where 
you can use similar shorthands to specify columns, and the general direction of 
the PySpark API has been to be consistent with those other two;
     *   This is a small, additive change that will not break anything;
     *   The reason support was not there in the first place was because the 
code that generated functions was originally designed for aggregators, which 
all support column names, but it was being used for other functions (e.g. 
lower, abs) that did not, so it seems like it was not intentional.
We are NOT going to:

  1.  Make any code changes in Scala;
     *   This requires first deciding if string support is desirable or not;
  2.  Decide whether or not strings should be supported in the Scala API;
     *   This requires a larger discussion and the above changes are 
independent of this;
  3.  Make PySpark support Column objects where it currently only supports 
strings (e.g. multi-argument version of drop());
     *   Converting from Column to column name is not something the API does 
right now, so this is a stronger change;
     *   This can be considered separately.
  4.  Do anything with R for now.
     *   Anyone is free to take on this, but I have no experience with R.

If you folks agree with this, let us know, so we can move forward with the 
merge.

Best.

-- André.

From: Reynold Xin <r...@databricks.com>
Date: Monday, 25 February 2019 at 00:49
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: dev <dev@spark.apache.org>, Sean Owen <sro...@gmail.com>, André Mello 
<asmello...@gmail.com>
Subject: Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions


The challenge with the Scala/Java API in the past is that when there are 
multipe parameters, it'd lead to an explosion of function overloads.


On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I hear three topics in this thread

1. I don’t think we should remove string. Column and string can both be “type 
safe”. And I would agree we don’t *need* to break API compatibility here.

2. Gaps in python API. Extending on #1, definitely we should be consistent and 
add string as param where it is missed.

3. Scala API for string - hard to say but make sense if nothing but for 
consistency. Though I can also see the argument of Column only in Scala. String 
might be more natural in python and much less significant in Scala because of 
$”foo” notation.

(My 2 c)


________________________________
From: Sean Owen <sro...@gmail.com<mailto:sro...@gmail.com>>
Sent: Sunday, February 24, 2019 6:59 AM
To: André Mello
Cc: dev
Subject: Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

I just commented on the PR -- I personally don't think it's worth
removing support for, say, max("foo") over max(col("foo")) or
max($"foo") in Scala. We can make breaking changes in Spark 3 but this
seems like it would unnecessarily break a lot of code. The string arg
is more concise in Python and I can't think of cases where it's
particularly ambiguous or confusing; on the contrary it's more natural
coming from SQL.

What we do have are inconsistencies and errors in support of string vs
Column as fixed in the PR. I was surprised to see that
df.select 
[df.select]<https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY&m=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc&s=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw&e=>(abs("col"))
 throws an error while df.select 
[df.select]<https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY&m=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc&s=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw&e=>(sqrt("col"))
doesn't. I think that's easy to fix on the Python side. Really I think
the question is: do we need to add methods like "def abs(String)" and
more in Scala? that would remain inconsistent even if the Pyspark side
is fixed.

On Sun, Feb 24, 2019 at 8:54 AM André Mello 
<asmello...@gmail.com<mailto:asmello...@gmail.com>> wrote:
>
> # Context
>
> This comes from [SPARK-26979], which became PR #23879 and then PR
> #23882. The following reflects all the findings made so far.
>
> # Description
>
> Currently, in the Scala API, some SQL functions have two overloads,
> one taking a string that names the column to be operated on, the other
> taking a proper Column object. This allows for two patterns of calling
> these functions, which is a source of inconsistency and generates
> confusion for new users, since it is hard to predict which functions
> will take a column name or not.
>
> The PySpark API partially solves this problem by internally converting
> the argument to a Column object prior to passing it through to the
> underlying JVM implementation. This allows for a consistent use of
> name literals across the API, except for a few violations:
>
> - lower()
> - upper()
> - abs()
> - bitwiseNOT()
> - ltrim()
> - rtrim()
> - trim()
> - ascii()
> - base64()
> - unbase64()
>
> These violations happen because for a subset of the SQL functions,
> PySpark uses a functional mechanism (`_create_function`) to directly
> call the underlying JVM equivalent by name, thus skipping the
> conversion step. In most cases the column name pattern still works
> because the Scala API has its own support for string arguments, but
> the aforementioned functions are also exceptions there.
>
> My proposal was to solve this problem by adding the string support
> where it was missing in the PySpark API. Since this is a purely
> additive change, it doesn't break past code. Additionally, I find the
> API sugar to be a positive feature, since code like `max("foo")` is
> more concise and readable than `max(col("foo"))`. It adheres to the
> DRY philosophy and is consistent with Python's preference for
> readability over type protection.
>
> However, upon submission of the PR, a discussion was started about
> whether it wouldn't be better to entirely deprecate string support
> instead - in particular with major release 3.0 in mind. The reasoning,
> as I understood it, was that this approach is more explicit and type
> safe, which is preferred in Java/Scala, plus it reduces the API
> surface area - and the Python API should be consistent with the others
> as well.
>
> Upon request by @HyukjinKwon I'm submitting this matter for discussion
> by this mailing list.
>
> # Summary
>
> There is a problem with inconsistency in the Scala/Python SQL API,
> where sometimes you can use a column name string as a proxy, and
> sometimes you have to use a proper Column object. To solve it there
> are two approaches - to remove the string support entirely, or to add
> it where it is missing. Which approach is best?
>
> Hope this is clear.
>
> -- André.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: 
> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
>

---------------------------------------------------------------------
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Reply via email to