[ 
https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138271#comment-16138271
 ] 

Vladimir Ozerov commented on IGNITE-5655:
-----------------------------------------

[~andrey-kuznetsov], my comments:
1) {{BinaryConfiguration.encoding}} - passing it as a {{byte}} is bad idea as 
it limits total number of available encodings. We should pass it as {{String}}, 
in the same way as it is done in Java classes. In case encoding is not 
supported, an exception should be thrown.
2) {{OdbcColumnMeta, OdbcMessageParser}} - unnecessary changes; ODBC doesn't 
distinguish between String types, this is our internal implementation detail.
3) {{SqlListenerNioListener}} - another unnecessary change. We encode strings 
to save space in storage and page memory. Encoding strings when sending error 
message doesn't make sense.
4) {{SqlListenerUtils}} - unnecessary changes, as encoded strings are not 
supported in JDBC/ODBC at the moment (I do not see any changes in drivers). 
Hence, we should always talk to drivers using {{UTF-8}} strings.
5) {{BinaryStringEncoding}} - should be a part of public API, so let's move it 
to {{org.apache.ignite.binary}} package.
6) {{BinaryReaderExImpl.fieldFlagNames}} - looks like an overkill to me. Just 
concatenate two well-known flags by hand.
7) {{BinarySerializedFieldComparator}} - it is illegal to ignore encoding byte. 
You may endup with situation where two different strings are encoded with 
different encoding, but have similar binary representation. Your code will 
report {{true}}, while {{false}} should be returned. Correct flow here:
7.1) If objects have different encodings - deserialize them and compare using 
String.compareTo
7.2) If encodings are same, we need to understand whether encoded 
representation is comparable. For example, suppose that in certain encoding 
{{A}} represented as {{2}}, and {{B}} is represented as {{1}}. Then 
{{A.compareTo(B) == -1}}, but {{compareArrays(serialize(A), serialize(B) == 
1}}, which is wrong. E.g. see \[1\]. We need to think on how to expose this 
properly, but I think this is a different task. For now let's just restrict any 
binary comparison of non-UTF8 strings and always deserialize them.
8) Feature is definitely undertested. At the very least we need the following 
tests:
8.1) Run SQL queries with custom encoding
8.2) Start two nodes with different encoding and see how they interact with 
each other.

Not covered cases which I propose to implement separately: ODBC, JDBC, C++, 
.NET, efficient ocmparison for inlined indexes and 
BinarySerializedFieldComparator.

The rest looks good to me. Thanks for contribution!

\[1\] https://dev.mysql.com/doc/refman/5.7/en/charset-general.html

> Introduce pluggable string encoder/decoder
> ------------------------------------------
>
>                 Key: IGNITE-5655
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5655
>             Project: Ignite
>          Issue Type: New Feature
>          Components: binary
>    Affects Versions: 2.0
>            Reporter: Valentin Kulichenko
>            Assignee: Andrey Kuznetsov
>             Fix For: 2.2
>
>
> Currently binary marshaller encodes strings in UTF-8. However, sometimes it 
> makes sense to serialize strings with different encodings to save space. 
> Let's add global property to control String encoding and customize our binary 
> protocol to support it. For instance, we can add another flag 
> {{ENCODED_STRING}}, which will write strings as follows:
> [flag][encoding_flag][str_len][str_bytes]
> First implementation should set preferred encoding for strings in 
> BinaryConfiguration. This setting is optional, default encoding is UTF-8. 
> Currently, the same BinaryConfiguration is used for all cluster nodes, thus 
> no encoding clashes are possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to