Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Artem Schitow Fri, 28 Jul 2017 10:08:18 -0700

> String encoding is a concept similar to "collation" in RDBMS. You can
> define it either globally, or on per-table basis.


Or on per-column (per-field) basis. Though Oracle does not have per-column 
charset, some other databases provide this option.

MySQL:
- https://dev.mysql.com/doc/refman/5.7/en/create-table.html
| CHAR[(length)] [BINARY]
[CHARACTER SET charset_name] [COLLATE collation_name]
  
| VARCHAR(length) [BINARY]
[CHARACTER SET charset_name] [COLLATE collation_name]

| TEXT [BINARY]      
[CHARACTER SET charset_name] [COLLATE collation_name]

SQL Server:
- 
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql 
<column_definition> ::=  
column_name <data_type>  
    [ FILESTREAM ]  
    [ COLLATE collation_name ]   

Postgres:
- https://www.postgresql.org/docs/9.6/static/sql-createtable.html
CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT 
EXISTS ] table_name
 ( [
  { 
column_name data_type [ COLLATE collation ]

> 1) I have a class Person with field "name". I have two caches/tables - one
> for US persons, where name is in Latin, another for RU persons with
> Cyrillic names. How can achieve optimal encoding formats for both tables?

You have to have two classes in this case, maybe with a common parent. Or you 
have to select a common denominator and settle with one encoding for both of 
them. Like Java did with UTF-16 java.util.String-s.

—
Artem Schitow
[email protected]




> On 28 Jul 2017, at 14:45, Vladimir Ozerov <[email protected]> wrote:
> 
> String encoding is a concept similar to "collation" in RDBMS. You can
> define it either globally, or on per-table basis. The same should be done
> for Ignite. We do not define behavior of a type. We define behavior of a
> *storage*.
> 
> Two cases when proposed approach with per-type and per-type-field approach
> doesn't work:
> 1) I have a class Person with field "name". I have two caches/tables - one
> for US persons, where name is in Latin, another for RU persons with
> Cyrillic names. How can achieve optimal encoding formats for both tables?
> 2) I have an empty grid. Now I want to create a cache/table with custom
> encoding. How can I do that without cluster restart? Nohow, because
> BinaryTypeConfiguration configured statically, while caches/tables can be
> created in runtime.
> 
> On Fri, Jul 28, 2017 at 2:38 PM, Pavel Tupitsyn <[email protected]>
> wrote:
> 
>>> As Pavel mentioned, Marshaller should not be tied to cache
>>> should be added to per-cache level
>> Not sure if I follow.
>> Marshalling and caching are two separate mechanisms.
>> Defining binary format in CacheConfiguration violates separation of
>> concerns.
>> 
>>> Encoding *must not* be added to per-class or per-field level, this is
>> wrong
>> What is wrong with this? BinaryTypeConfiguration looks the right place for
>> such a setting.
>> Are we talking from SQL standpoint here, so you want this to be defined
>> somehow via DDL in future?
>> 
>> On Fri, Jul 28, 2017 at 2:30 PM, Vladimir Ozerov <[email protected]>
>> wrote:
>> 
>>> Encoding *must not* be added to per-class or per-field level, this is
>>> wrong.
>>> 
>>> It should be added to per-cache level, and to per-cache-column level in
>>> future.
>>> 
>>> пт, 28 июля 2017 г. в 14:27, Andrey Kuznetsov <[email protected]>:
>>> 
>>>> We discussed this with Pavel and Anton just a moment ago. Summary
>>> follows.
>>>> 
>>>> - New byte "flag" is to be added (ENCODED_STRING)
>>>> - 'Encoding' property is to be added at
>>>>  -- global level (BinaryConfiguration)
>>>>  -- per-class level (BinaryTypeConfiguration)
>>>>  -- per-field level (BinaryTypeConfiguration)
>>>> 
>>>> 2017-07-28 14:15 GMT+03:00 Vladimir Ozerov [via Apache Ignite
>>> Developers] <
>>>> [email protected]>:
>>>> 
>>>>> As Pavel mentioned, Marshaller should not be tied to cache,
>>> BinaryObject
>>>>> should be self-explanatory, i.e. containing all information necessary
>>> for
>>>>> unmarshalling. This is an absolute requirement.
>>>>> 
>>>>> We will have one extra byte for in serialized form, meaning that
>>>> advantage
>>>>> of custom encoding will become evident for all strings with length >=
>>> 1,
>>>>> which is perfectly fine. I do not quite understand what are we
>> arguing
>>>>> about.
>>>>> 
>>>>> As far as configuration, we can do it as follows:
>>>>> 
>>>>> 1) Add global encoding, UTF8 by default.
>>>>> 2) Add per-cache encoding.
>>>>> 3) Add encoding to JDBC and ODBC driver properties.
>>>>> 
>>>>> This should be enough.
>>>>> 
>>>>> 
>>>> --
>>>> Best regards,
>>>>  Andrey Kuznetsov.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context:
>>>> http://apache-ignite-developers.2346864.n4.nabble.
>>> com/Non-UTF-8-string-encoding-support-in-BinaryMarshaller-
>>> IGNITE-5655-tp20024p20161.html
>>>> Sent from the Apache Ignite Developers mailing list archive at
>>> Nabble.com.
>>> 
>>

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Reply via email to