Re: [Maria-discuss] Collations, trailing spaces and unique indexes

Alexander Barkov Tue, 15 Mar 2016 22:24:07 -0700

  Hi Binarus,

Thanks for a detailed answer.
It seems that NO PAD collations are going to be a very welcome feature.


Please see my replies and questions below.


On 03/11/2016 07:56 PM, Binarus wrote:
> Hello Kristian, hello Alexander,
> 
> On 11.03.2016 12:20, Alexander Barkov wrote:
>> So we follow the PAD SPACE requirement for default collations here.
> 
> I think it's reasonable that the default collations are implemented according 
> to the standard.
> Nevertheless, the standard enforces an exceptionally stupid policy here 
> (IMHO),

Using PAD SPACE by default makes it safe to ALTER TABLE and switch:

- from VARCHAR(N) to CHAR(M)
- from CHAR(N)    to VARCHAR(M)

where M>=N.

I think that was the main reason for make PAD SPACE by default.
So I agree with the standard here.


> and therefore I think we are in urgent need of additional (non-default) 
> collations which solve that problem (and what you have written below makes me 
> hope ...).
> 
>>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I
>>> think it works much the same in most respects. But obviously some semantics
>>> is lost when the server no longer is aware of the character set used.
>>
>> Correct.
>  
> I'll go that way, sticking with MySQL, if the problem won't be solved by new 
> collations in MariaDB.
> 
>>>
>>>> Since the index behaviour obviously depends on the collation, would
>>>> building an own collation which does not PADSPACE be an option? I have
>>>
>>> That would be interesting, actually. I don't know what support there is for
>>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)?
>>
>> We don't have NO PAD collations yet.
> 
> Thanks for making this clear; I haven't found something about it in the 
> documents. By the way, the document sections which deal with the char data 
> types could eventually mention these pitfalls (like the MySQL documentation) 
> :-)

Sure, we'll mention this clearer when we add NO PAD collations.

> 
>> I don't remember that anybody ever asked for this before.
> 
> I wonder why, given the dozens (or hundreds?) of threads and tutorials around 
> the web which try to explain how to circumvent these issues (none of them 
> provides a general solution besides switching to ...BINARY).
>  
>> [...]
>> Sounds like a few hours of work.
> 
> This is very good news. I will now try to give some arguments why this really 
> should be done.
> 
> 1) Avoid programming mistakes
> 
> People might be programming for decades, but when not being database experts 
> and thrown into a web development project, even those experienced programmers 
> won't come to the idea that a database which returns trailing spaces in the 
> result set of SELECT statements throws away these trailing spaces when 
> building the index or when comparing (e.g. for evaluation the conditions of 
> the WHERE CLAUSE). This is completely illogical and dangerous and makes 
> debugging substantially more difficult.
> 
> 2) Developers (and not DBMSs) should be in control of what happens to 
> trailing spaces

Exactly.
The DBMS should supply a full variety of useful collations,
so the users have enough flexibility to choose the one that suites best.
So I like the idea of new NO PAD collations very much.

> 
> A DBMS must not silently change data, nor for comparing nor for other 
> purposes. It is solely the decision of the application logic if trailing 
> spaces are acceptable or desired in data values (possibly entered by users), 
> but not the decision of the DBMS. Nothing else than the application logic has 
> to decide what to do with trailing spaces (trim them, keep them, reject user 
> input, return error messages etc.), and the DBMS has to store, retrieve, 
> compare and index the results of the application logic as-is. Period.
> 
> However, there is no argument against additional and optional functions or 
> configuration flags of the DBMS which allow easy trimming or other (automatic 
> and silent) transformation of the values and all sorts of weird compare 
> modes, if these are optional. But in the first place, there must also be a 
> mode / column type for every sort of data, notably for strings, in which data 
> is stored, compared and indexed as-is.
> 
> We already have the binary data types which in this respect behave like 
> desired, but all string functions fail with binary data values or behave 
> surprisingly / impractically, so nearly all string functions (which are very 
> important in many applications) are lost for binary data type columns.
> 
> 3) There are use cases for string data with trailing spaces
> 
> I have seen some examples of trailing spaces, some of them useless, but some 
> quite meaningful. In every case, the application logic relied on a string 
> with trailing space being not equal to the same string without the trailing 
> space. In some cases, the developers had not thought about the issues the 
> trailing spaces could cause, i.e. their programs were faulty; in the other 
> cases, they have put a lot of effort and used the weirdest tricks and index 
> combinations to circumvent the problems. I can't remember a case where they 
> have switched to ...BINARY data types.
> 
> Two of the funniest cases (just because it's Friday - skip the next two 
> sections if you don't want to laugh):
> 
> Some application actually stored an internal status bit at the end of 
> existing user input, probably because the programmers were in a hurry or had 
> no administrative access to the MySQL server, i.e. they couldn't create a new 
> status column in the respective table. Now how could you save a status bit 
> within an existing string without making it visible to the users? The answer 
> is of course: Append a space at the end of the string to set the status, 
> remove it to clear the status. Of course, the respective column had to be 
> UNIQUE in the sense that two identical user inputs should be allowed if the 
> "status bit" differed ...
> 
> Some other application generated session keys in string form and used a part 
> of the ASCII character set for the keys (e.g. a-z, A-Z, 0-9, special chars 
> like $% and so on, including the space). Of course, as chance would have it, 
> there could be a trailing space at the end of the session key. I can't 
> remember if this led to problems in this case, but theoretically, it surely 
> could have.
> 
> Unfortunately, my own use case is too difficult to explain here.

Thanks for use examples. I liked the one about "status bit" very much :)

> 
> To summarize, there are all sorts of serious pitfalls which might be very 
> difficult to debug and very difficult to foresee if the DBMS ignores trailing 
> spaces when storing, retrieving, comparing or indexing strings, so I would be 
> very grateful, and I would immediately switch to MariaDB, if appropriate 
> collations would be added there.
>  
>> But then thorough testing will be needed,
>> which will be the most time consuming part.
> 
> Probably, but I have a feeling that *not* including trailing spaces in 
> comparisons / collations is much more difficult than including them.
> A trailing space is just a char as every other char, isn't it? So I suppose 
> the structure of the respective functions for _nopad_ collations will be much 
> less complicated than the structure of the existing functions.

Yes. I think so. The idea is just to disable the code that ignores
trailing spaces.


> Given that, the testing eventually wouldn't consume too much time ...

We'll need to cover thoroughly many thing, including but not limited to:

- DISTINCT
- UNION
- ORDER BY with indexes
- ORDER BY without indexes
- GROUP BY
- Mixing NO PAD and PAD collations
- Function output
- GROUP_CONCAT(DISTINCT)


>  
>> Also, for consistency, it's worthy to implement
>> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once,
>> and then eventually NO PAD collations for all other character sets.
> 
> Personally, I would be totally happy if I had the three new utf8 collations 
> (I probably will never use anything else than utf8 for strings).

We're interested in implementing NO PAD counterparts for the default
and the _bin collations under terms of Google Summer Of Code 2016
(GSoC). By this will take a few months. I guess you need this more
urgently.

Question: would utf8_unicode_nopad_ci work for you for now?
We could implement it separately, without having to wait until
the end of GSoC.

> 
> Thank you very much!
> 
> Binarus
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-discuss
> Post to     : maria-discuss@lists.launchpad.net
> Unsubscribe : https://launchpad.net/~maria-discuss
> More help   : https://help.launchpad.net/ListHelp
> 

_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : maria-discuss@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp

Re: [Maria-discuss] Collations, trailing spaces and unique indexes

Reply via email to