Re: [Maria-discuss] Collations, trailing spaces and unique indexes

Binarus Fri, 11 Mar 2016 09:25:35 -0800

On 11.03.2016 15:45, Alexander Barkov wrote:
> FYI, I have added a new task for this:
> 
> https://jira.mariadb.org/browse/MDEV-9711
>


Alexander, not sure if my question makes any sense, but what about the _cs 
collations? Like latin1_general_cs, latin2_czech_cs? Don't we like to have 
these, too?

Regards,

Binarus


> 
> On 03/11/2016 03:20 PM, Alexander Barkov wrote:
>> Hello Binarus, Kristian,
>>
>> On 03/11/2016 02:25 PM, Kristian Nielsen wrote:
>>> Binarus <li...@binarus.de> writes:
>>>
>>>> "All MySQL collations are of type PADSPACE. This means that all CHAR,
>>>> VARCHAR, and TEXT values in MySQL are compared without regard to any
>>>> trailing spaces. “Comparison” in this context does not include the
>>>
>>> Yes, I have always found this terminally stupid as well. But I think it
>>> comes from the SQL standard.
>>
>> Probably this comes from here:
>>
>>> 2) A <standard character set name> specifies the name of a character set 
>>> that is defined by a national or
>>> international standard. The character repertoire of CS is defined by the 
>>> standard defining the character set
>>> identified by that <standard character set name>. The default collation of 
>>> the character set is defined by
>>> the order of the characters in the standard and has the PAD SPACE 
>>> characteristic.
>>
>> So we follow the PAD SPACE requirement for default collations here.
>>
>> I guess the reasoning here is to treat CHAR and VARCHAR in a similar way
>> by default. I agree with the standard here. We don't want different
>> behavior for CHAR and VARCHAR.
>>
>>
>> Btw, we don't follow the requirement that the default collation must be
>> defined according to "the order of the characters in the standard".
>> Default collations are traditionally case insensitive in MariaDB/MySQL.
>>
>>>
>>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
>>> think it works much the same in most respects. But obviously some semantics
>>> is lost when the server no longer is aware of the character set used.
>>
>> Correct.
>>
>>>
>>>> Since the index behaviour obviously depends on the collation, would
>>>> building an own collation which does not PADSPACE be an option? I have
>>>
>>> That would be interesting, actually. I don't know what support there is for
>>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
>>
>> We don't have NO PAD collations yet.
>> I don't remember that anybody ever asked for this before.
>>
>> From a glance, this should not be too hard to implement.
>>
>> You previously wrote you need utf8.
>> So there are three options to make a new collation,
>> based on one of these existing collations:
>>
>> - utf8_general_ci
>> - utf8_unicode_ci
>> - utf8_bin
>>
>> but with NO PAD characteristics.
>>
>> Suppose you need utf8_general_nopad_ci
>> (i.e. based on utf8_general_ci)
>>
>>
>> utf8_general_ci is implemented in strings/ctype-utf8.c
>>
>> A new collation handler should be added, similar to this one:
>>
>> static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler =
>> {
>>     NULL,               /* init */
>>     my_strnncoll_utf8_general_ci,
>>     my_strnncollsp_utf8_general_ci,
>>     my_strnxfrm_unicode,
>>     my_strnxfrmlen_unicode,
>>     my_like_range_mb,
>>     my_wildcmp_utf8,
>>     my_strcasecmp_utf8,
>>     my_instr_mb,
>>     my_hash_sort_utf8,
>>     my_propagate_complex
>> };
>>
>>
>> but these three virtual functions must be redefined to new
>> similar functions that do not ignore trailing spaces:
>>
>> - my_strnncollsp_utf8_general_ci - this is used for BTREE indexes
>> - my_hash_sort_utf8              - this is used for HASH indexes
>> - my_strnxfrm_unicode            - this is used for filesort
>>                                    (non-indexed ORDER BY)
>>
>> All other functions can be reused from the existing PAD SPACE collation.
>>
>>
>>
>> Then a new "struct charset_info_st" should be defined,
>> similar to my_charset_utf8_general_ci.
>>
>> Sounds like a few hours of work.
>>
>>
>> But then thorough testing will be needed,
>> which will be the most time consuming part.
>>
>>
>> Also, for consistency, it's worthy to implement
>> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
>> and then eventually NO PAD collations for all other character sets.
>>
>>>
>>>  - Kristian.
>>>

_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : maria-discuss@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp

Re: [Maria-discuss] Collations, trailing spaces and unique indexes

Reply via email to