Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-16 Thread Takeshi Yamamuro
The revert looks fine to me for keeping the compatibility.
Also, I think the different orders between the systems easily lead to
mistakes, so
, as Wenchen suggested, it looks nice to make these (two parameters) trim
functions
unrecommended in future releases:
https://github.com/apache/spark/pull/27540#discussion_r377682518
Actually, I think the SQL TRIM syntax is enough for trim use cases...

Bests,
Takeshi


On Sun, Feb 16, 2020 at 3:02 AM Maxim Gekk 
wrote:

> Also if we look at possible combination of trim parameters:
> 1. foldable srcStr + foldable trimStr
> 2. foldable srcStr + non-foldable trimStr
> 3. non-foldable srcStr + foldable trimStr
> 4. non-foldable srcStr + non-foldable trimStr
>
> The case # 2 seems a rare case, and # 3 is probably the most common case.
> Once we see the second case, we could output a warning with a notice about
> the order of parameters.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Sat, Feb 15, 2020 at 5:04 PM Wenchen Fan  wrote:
>
>> It's unfortunate that we don't have a clear document to talk about
>> breaking changes (I'm working on it BTW). I believe the general guidance
>> is: *avoid breaking changes unless we have to*. For example, the
>> previous result was so broken that we have to fix it, moving to Scala 2.12
>> makes us have to break some APIs, etc.
>>
>> For this particular case, do we have to switch the parameter order? It's
>> different from some systems, the order was not decided explicitly, but I
>> don't think they are strong reasons. People from RDBMS should use the SQL
>> standard TRIM syntax more often. People using prior Spark versions should
>> have figured out the parameter order of Spark TRIM (there was no document)
>> and adjusted their queries. There is no such standard that defines the
>> parameter order of the TRIM function.
>>
>> In the long term, we would also promote the SQL standard TRIM syntax. I
>> don't see much benefit of "fixing" the parameter order that worth to make a
>> breaking change.
>>
>> Thanks,
>> Wenchen
>>
>>
>>
>> On Sat, Feb 15, 2020 at 3:44 AM Dongjoon Hyun 
>> wrote:
>>
>>> Please note that the context if TRIM/LTRIM/RTRIM with two-parameters and
>>> TRIM(trimStr FROM str) syntax.
>>>
>>> This thread is irrelevant to one-parameter TRIM/LTRIM/RTRIM.
>>>
>>> On Fri, Feb 14, 2020 at 11:35 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 I'm sending this email because the Apache Spark committers had better
 have a consistent point of views for the upcoming PRs. And, the community
 policy is the way to lead the community members transparently and clearly
 for a long term good.

 First of all, I want to emphasize that, like Apache Spark 2.0.0, Apache
 Spark 3.0.0 is going to achieve many improvement in SQL areas.

 Especially, we invested lots of efforts to improve SQL ANSI compliance
 and related test coverage for the future. One simple example is the
 following.

 [SPARK-28126][SQL] Support TRIM(trimStr FROM str) syntax

 With this support, we are able to move away from one of esoteric Spark
 function `TRIM/LTRIM/RTRIM`.
 (Note that the above syntax is ANSI standard, but we have additional
 non-standard these functions, too.)

 Here is the summary of the current status.

 ++-+---+
 | SQL Progressing Engine | TRIM Syntax | TRIM Function |
 +--+
 | Spark 3.0.0-preview/2  |  O  |   O   |
 | PostgreSQL |  O  |   O   |
 | Presto |  O  |   O   |
 | MySQL  |  O(*)   |   X   |
 | Oracle |  O  |   X   |
 | MsSQL  |  O  |   X   |
 | Terradata  |  O  |   X   |
 | Hive   |  X  |   X   |
 | Spark 1.6 ~ 2.2|  X  |   X   |
 | Spark 2.3 ~ 2.4|  X  |   O(**)   |
 ++-+---+
 * MySQL doesn't allow multiple trim character
 * Spark 2.3 ~ 2.4 have the function in a different way.

 Here is the illustrative example of the problem.

 postgres=# SELECT trim('yxTomxx', 'xyz');
 btrim
 ---
 Tom

 presto:default> SELECT trim('yxTomxx', 'xyz');
 _col0
 ---
 Tom

 spark-sql> SELECT trim('yxTomxx', 'xyz');
 z

 Here is our history to fix the above issue.

 [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order
 issue
 1. https://github.com/apache/spark/pull/24902
(Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
 released.)

Re: [DOCS] Spark SQL Upgrading Guide

2020-02-16 Thread Jacek Laskowski
Hi,

Never mind. Found this [1]:

> This config is deprecated and it will be removed in 3.0.0.

And so it has :) Thanks and sorry for the trouble.

[1]
https://github.com/apache/spark/blob/830a4ec59b86253f18eb7dfd6ed0bbe0d7920e5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1306-L1307

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Sat, Feb 15, 2020 at 7:44 PM Jacek Laskowski  wrote:

> Hi,
>
> Just noticed that
> http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html (Spark
> 2.4.5) has formatting issues in "Upgrading from Spark SQL 2.4.3 to 2.4.4"
> [1] which got fixed in master [2]. That's OK.
>
> What made me wonder was the other change to the section "Upgrading from
> Spark SQL 2.4 to 2.4.5" [3] that had the following item included:
>
> "Starting from 2.4.5, SQL configurations are effective also when a Dataset
> is converted to an RDD and its plan is executed due to action on the
> derived RDD. The previous behavior can be restored setting
> spark.sql.legacy.rdd.applyConf to false: in this case, SQL configurations
> are ignored for operations performed on a RDD derived from a Dataset."
>
> Why was this removed in master [4]? It was mentioned in "Notable changes"
> of Spark Release 2.4.5 [5].
>
> [1]
> http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-243-to-244
> [2]
> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-243-to-244
> [3]
> http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-24-to-245
> [4]
> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-244-to-245
> [5] http://spark.apache.org/releases/spark-release-2-4-5.html
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>


Re: [DOCS] Spark SQL Upgrading Guide

2020-02-16 Thread Hyukjin Kwon
Thanks for checking it, Jacek.

2020년 2월 16일 (일) 오후 7:23, Jacek Laskowski 님이 작성:

> Hi,
>
> Never mind. Found this [1]:
>
> > This config is deprecated and it will be removed in 3.0.0.
>
> And so it has :) Thanks and sorry for the trouble.
>
> [1]
> https://github.com/apache/spark/blob/830a4ec59b86253f18eb7dfd6ed0bbe0d7920e5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1306-L1307
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Sat, Feb 15, 2020 at 7:44 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Just noticed that
>> http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html (Spark
>> 2.4.5) has formatting issues in "Upgrading from Spark SQL 2.4.3 to 2.4.4"
>> [1] which got fixed in master [2]. That's OK.
>>
>> What made me wonder was the other change to the section "Upgrading from
>> Spark SQL 2.4 to 2.4.5" [3] that had the following item included:
>>
>> "Starting from 2.4.5, SQL configurations are effective also when a
>> Dataset is converted to an RDD and its plan is executed due to action on
>> the derived RDD. The previous behavior can be restored setting
>> spark.sql.legacy.rdd.applyConf to false: in this case, SQL configurations
>> are ignored for operations performed on a RDD derived from a Dataset."
>>
>> Why was this removed in master [4]? It was mentioned in "Notable changes"
>> of Spark Release 2.4.5 [5].
>>
>> [1]
>> http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-243-to-244
>> [2]
>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-243-to-244
>> [3]
>> http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-24-to-245
>> [4]
>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-244-to-245
>> [5] http://spark.apache.org/releases/spark-release-2-4-5.html
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>


Newbie issues

2020-02-16 Thread PRAKASH GOPALSAMY
Hi Team,
I am using Spark in our projects. Now, I would like to contribute to spark
development. Kindly help me to get the newbie issues which I can start
contributing and get familiar with Apache Spark contribution process.

Regards,
Prakash Gopalsamy.