Ran Tao created CALCITE-5840:
--------------------------------
Summary: Incorrect SOUNDEX function semantics in MySql library
Key: CALCITE-5840
URL: https://issues.apache.org/jira/browse/CALCITE-5840
Project: Calcite
Issue Type: Bug
Components: core
Affects Versions: 1.34.0
Reporter: Ran Tao
Assignee: Ran Tao
There are 2 problems.
1. 4-char fixed length & mysql non-fixed length
current in calcite:
SELECT SOUNDEX('Quadratically');
-> 'Q363'
mysql
mysql> SELECT SOUNDEX('Quadratically');
-> 'Q36324'
As docs illustrated below, A standard soundex string is four characters long,
but the MySQL SOUNDEX() function returns an arbitrarily long string.
https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex
https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_soundex
2. when soundex deal with multi bytes string such as UTF8, it will cause
IllegalArgumentException because we use Apache Commons Codec to get the result
(it implement like this).
however, MySQL is not this behavior. e.g.
{code:java}
mysql> select soundex('字节');
+-------------------+
| soundex('字节') |
+-------------------+
| 字000 |
+-------------------+
1 row in set (0.00 sec)
mysql> select soundex('字节字节');
+-------------------------+
| soundex('字节字节') |
+-------------------------+
| 字000 |
+-------------------------+
1 row in set (0.10 sec)
mysql> select soundex('字节字节字节');
+-------------------------------+
| soundex('字节字节字节') |
+-------------------------------+
| 字000 |
+-------------------------------+
1 row in set (0.03 sec)
mysql> select soundex('バイト');
+----------------------+
| soundex('バイト') |
+----------------------+
| バ000 |
+----------------------+
1 row in set (0.04 sec)
mysql> select soundex('байт');
+---------------------+
| soundex('байт') |
+---------------------+
| б000 |
+---------------------+
1 row in set (0.00 sec) {code}
the different string parameter above means 'BYTE'.
I think we need to correct it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)