[ 
https://issues.apache.org/jira/browse/FLINK-10257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16599559#comment-16599559
 ] 

Hequn Cheng commented on FLINK-10257:
-------------------------------------

Hi, [~pnowojski] thanks for looking into this. I found some descriptions about 
comparison of character strings in sql standard:
 _When values of unequal length are compared, if the collation for the 
comparison has the NO PAD characteristic and the shorter value is equal to some 
prefix of the longer value, then the shorter value is considered less than the 
longer value. If the collation for the comparison has the PAD SPACE 
characteristic, for the purposes of the comparison, the shorter value is 
effectively extended to the length of the longer by concatenation of <space>s 
on the right._

So, I think there are may be two different modes for comparison: NO PAD and PAD 
SPACE.

*For NO PAD collation mode.*
Compare strings as we do now, but Calcite should provide a NO PAD collation 
mode so that {{SELECT 'POLAND' = 'POLAND    '}} returns false. In this mode, we 
should not pad blanks to strings, say, case when. But we may not stop users 
from doing it. Users can pad blanks by cast or simply write {{'POLAND    '}}. 
Users need to bear the corresponding consequences, since all trailing blanks 
are added by themselves. 

*For PAD SPACE collation mode.*
Ignore trailing blanks when compare strings. In this mode, case when/cast will 
pad blanks.

What do your guys think?


> Incorrect CHAR type support in Flink SQL and Table API
> ------------------------------------------------------
>
>                 Key: FLINK-10257
>                 URL: https://issues.apache.org/jira/browse/FLINK-10257
>             Project: Flink
>          Issue Type: Bug
>          Components: Table API &amp; SQL
>            Reporter: Piotr Nowojski
>            Priority: Critical
>
> Despite that we officially do not support CHAR type, this type is visible and 
> accessible for the users. First of all, string literals have default type of 
> CHAR in SQL. Secondly users can always cast expressions/columns to CHAR.
> Problem is that we do not support CHAR correctly. We mishandle it in:
>  # comparisons and functions
>  # writing values to sinks
> According to SQL standard (and as almost all of the other databases do), CHAR 
> comparisons should ignore white spaces. On the other hand, functions like 
> {{CONCAT}} or {{LENGTH}} shouldn't: 
> [http://troels.arvin.dk/db/rdbms/#data_types-char] .
> Currently in In Flink we completely ignore those rules. Sometimes we store 
> internally CHAR with padded spaces sometimes without. This results with semi 
> random behaviour with respect to comparisons/functions/writing to sinks. For 
> example following query:
> {code:java}
> tEnv.sqlQuery("SELECT CAST(s AS CHAR(10)) FROM 
> sourceTable").insertInto("targetTable")
> env.execute()
> {code}
> Where `sourceTable` has single {{VARCHAR(10)}} column with values: "Hi", 
> "Hello", "Hello world", writes to sink not padded strings (correctly), but 
> following query:
> {code:java}
> tEnv.sqlQuery("SELECT * FROM (SELECT CAST(s AS CHAR(10)) c FROM sourceTable) 
> WHERE c = 'Hi'")
>   .insertInto("targetTable")
> env.execute(){code}
> Incorrectly filters out all of the results, because {{CAST(s AS CHAR(10))}} 
> is a NOOP in Flink, while 'Hi' constant handed by Calcite to us will be 
> padded with 8 spaces.
> On the other hand following query produces strings padded with spaces:
> {code:java}
> tEnv.sqlQuery("SELECT CASE l WHEN 1 THEN 'GERMANY' WHEN 2 THEN 'POLAND' ELSE 
> 'this should not happen' END FROM sourceTable")
>   .insertInto("targetTable")
> env.execute()
> val expected = Seq(
>   "GERMANY",
>   "POLAND",
>   "POLAND").mkString("\n")
> org.junit.ComparisonFailure: Different elements in arrays: expected 3 
> elements and received 3
> expected: [GERMANY, POLAND, POLAND]
> received: [GERMANY , POLAND , POLAND ]
> {code}
> To make matter even worse, Calcite's constant folding correctly performs 
> comparisons, while if same comparisons are performed by Flink, they yield 
> different results. In other words in SQL:
> {code:java}
> SELECT 'POLAND' = 'POLAND    '
> {code}
> return true, but same expression performed on columns
> {code:java}
> SELECT CAST(country as CHAR(10)) = CAST(country_padded as CHAR(10)) FROM 
> countries{code}
> returns false.
> To further complicated things, in SQL our string literals have {{CHAR}} type, 
> while in Table API our literals have String type (effectively {{VARCHAR}}) 
> making results inconsistent between those two APIs.
>  
> CC [~twalthr] [~fhueske] [~hequn8128]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to