[ 
https://issues.apache.org/jira/browse/CALCITE-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Tao updated CALCITE-5840:
-----------------------------
    Description: 
There are 2 problems.

1. 4-char fixed length & mysql non-fixed length
current in calcite:
SELECT SOUNDEX('Quadratically');
    -> 'Q363'

mysql
mysql> SELECT SOUNDEX('Quadratically');
    -> 'Q36324'

As docs illustrated below, A standard soundex string is four characters long, 
but the MySQL SOUNDEX() function returns an arbitrarily long string.

[https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex]
[https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_soundex]

2. when current calcite mysql soundex deal with multi bytes string such as 
UTF8, it will cause IllegalArgumentException because we use Apache Commons 
Codec to get the result (it implement like this).
however, MySQL is not this behavior. e.g.
{code:java}
mysql> select soundex('字节');
+-------------------+
| soundex('字节')   |
+-------------------+
| 字000             |
+-------------------+
1 row in set (0.00 sec)

mysql> select soundex('字节字节');
+-------------------------+
| soundex('字节字节')     |
+-------------------------+
| 字000                   |
+-------------------------+
1 row in set (0.10 sec)

mysql> select soundex('字节字节字节');
+-------------------------------+
| soundex('字节字节字节')       |
+-------------------------------+
| 字000                         |
+-------------------------------+
1 row in set (0.03 sec)

mysql> select soundex('バイト');
+----------------------+
| soundex('バイト')    |
+----------------------+
| バ000                |
+----------------------+
1 row in set (0.04 sec)

mysql> select soundex('байт');
+---------------------+
| soundex('байт')     |
+---------------------+
| б000                |
+---------------------+
1 row in set (0.00 sec) {code}
the different string parameter above means 'BYTE'.

I think we need to correct it.

  was:
There are 2 problems.

1. 4-char fixed length & mysql non-fixed length
current in calcite:
SELECT SOUNDEX('Quadratically');
    -> 'Q363'

mysql
mysql> SELECT SOUNDEX('Quadratically');
    -> 'Q36324'

As docs illustrated below, A standard soundex string is four characters long, 
but the MySQL SOUNDEX() function returns an arbitrarily long string.

https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex
https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_soundex


2. when soundex deal with multi bytes string such as UTF8, it will cause 
IllegalArgumentException because we use Apache Commons Codec to get the result 
(it implement like this).
however, MySQL is not this behavior. e.g.
{code:java}
mysql> select soundex('字节');
+-------------------+
| soundex('字节')   |
+-------------------+
| 字000             |
+-------------------+
1 row in set (0.00 sec)

mysql> select soundex('字节字节');
+-------------------------+
| soundex('字节字节')     |
+-------------------------+
| 字000                   |
+-------------------------+
1 row in set (0.10 sec)

mysql> select soundex('字节字节字节');
+-------------------------------+
| soundex('字节字节字节')       |
+-------------------------------+
| 字000                         |
+-------------------------------+
1 row in set (0.03 sec)

mysql> select soundex('バイト');
+----------------------+
| soundex('バイト')    |
+----------------------+
| バ000                |
+----------------------+
1 row in set (0.04 sec)

mysql> select soundex('байт');
+---------------------+
| soundex('байт')     |
+---------------------+
| б000                |
+---------------------+
1 row in set (0.00 sec) {code}
the different string parameter above means 'BYTE'.

I think we need to correct it.


> Incorrect SOUNDEX function semantics in MySql library
> -----------------------------------------------------
>
>                 Key: CALCITE-5840
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5840
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.34.0
>            Reporter: Ran Tao
>            Assignee: Ran Tao
>            Priority: Major
>
> There are 2 problems.
> 1. 4-char fixed length & mysql non-fixed length
> current in calcite:
> SELECT SOUNDEX('Quadratically');
>     -> 'Q363'
> mysql
> mysql> SELECT SOUNDEX('Quadratically');
>     -> 'Q36324'
> As docs illustrated below, A standard soundex string is four characters long, 
> but the MySQL SOUNDEX() function returns an arbitrarily long string.
> [https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex]
> [https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_soundex]
> 2. when current calcite mysql soundex deal with multi bytes string such as 
> UTF8, it will cause IllegalArgumentException because we use Apache Commons 
> Codec to get the result (it implement like this).
> however, MySQL is not this behavior. e.g.
> {code:java}
> mysql> select soundex('字节');
> +-------------------+
> | soundex('字节')   |
> +-------------------+
> | 字000             |
> +-------------------+
> 1 row in set (0.00 sec)
> mysql> select soundex('字节字节');
> +-------------------------+
> | soundex('字节字节')     |
> +-------------------------+
> | 字000                   |
> +-------------------------+
> 1 row in set (0.10 sec)
> mysql> select soundex('字节字节字节');
> +-------------------------------+
> | soundex('字节字节字节')       |
> +-------------------------------+
> | 字000                         |
> +-------------------------------+
> 1 row in set (0.03 sec)
> mysql> select soundex('バイト');
> +----------------------+
> | soundex('バイト')    |
> +----------------------+
> | バ000                |
> +----------------------+
> 1 row in set (0.04 sec)
> mysql> select soundex('байт');
> +---------------------+
> | soundex('байт')     |
> +---------------------+
> | б000                |
> +---------------------+
> 1 row in set (0.00 sec) {code}
> the different string parameter above means 'BYTE'.
> I think we need to correct it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to