[ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831392#comment-17831392
 ] 

Gideon P edited comment on SPARK-47413 at 3/27/24 2:53 PM:
-----------------------------------------------------------

bq.  First confirm what is the expected behaviour for these functions when 
given collated strings,

h1. Collation Handling Summary

For the substr, substring, left, and right string manipulation functions in 
SQL, the explicit or implicit collation of the first parameter is preserved in 
the function's output. This means that if the input string has a specific 
collation (whether defined explicitly through a COLLATE expression or 
implicitly by its source, such as a column's collation), this collation is 
maintained in the resulting string produced by these functions. 

The behavior of these functions, apart from collation handling, remains 
consistent with their standard operation. This includes their handling of 
starting positions, lengths, and their ability to work with both positive and 
negative indices for defining substring boundaries.

While `len` parameter (a number) can be a string, it's implicit or explicit 
collation will be throw away and not effect output.

Note: unit tests should show that we achieved the following:
 Collation will be supported for:
-       STRING columns
-       STRING expressions
-       STRING fields in structs

h1. Session Collation
The third level of collation. Will get back to you about expectations with 
regards to these four functions. 


was (Author: JIRAUSER304403):
>  First confirm what is the expected behaviour for these functions when given 
> collated strings,

h1. Collation Handling Summary

For the substr, substring, left, and right string manipulation functions in 
SQL, the explicit or implicit collation of the first parameter is preserved in 
the function's output. This means that if the input string has a specific 
collation (whether defined explicitly through a COLLATE expression or 
implicitly by its source, such as a column's collation), this collation is 
maintained in the resulting string produced by these functions. 

The behavior of these functions, apart from collation handling, remains 
consistent with their standard operation. This includes their handling of 
starting positions, lengths, and their ability to work with both positive and 
negative indices for defining substring boundaries.

While `len` parameter (a number) can be a string, it's implicit or explicit 
collation will be throw away and not effect output.

Note: unit tests should show that we achieved the following:
 Collation will be supported for:
-       STRING columns
-       STRING expressions
-       STRING fields in structs

h1. Session Collation
The third level of collation. Will get back to you about expectations with 
regards to these four functions. 

> Substring, Right, Left (all collations)
> ---------------------------------------
>
>                 Key: SPARK-47413
>                 URL: https://issues.apache.org/jira/browse/SPARK-47413
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Uroš Bojanić
>            Priority: Major
>              Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to