Hi Binarus, Thanks for a detailed answer. It seems that NO PAD collations are going to be a very welcome feature.
Please see my replies and questions below. On 03/11/2016 07:56 PM, Binarus wrote: > Hello Kristian, hello Alexander, > > On 11.03.2016 12:20, Alexander Barkov wrote: >> So we follow the PAD SPACE requirement for default collations here. > > I think it's reasonable that the default collations are implemented according > to the standard. > Nevertheless, the standard enforces an exceptionally stupid policy here > (IMHO), Using PAD SPACE by default makes it safe to ALTER TABLE and switch: - from VARCHAR(N) to CHAR(M) - from CHAR(N) to VARCHAR(M) where M>=N. I think that was the main reason for make PAD SPACE by default. So I agree with the standard here. > and therefore I think we are in urgent need of additional (non-default) > collations which solve that problem (and what you have written below makes me > hope ...). > >>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I >>> think it works much the same in most respects. But obviously some semantics >>> is lost when the server no longer is aware of the character set used. >> >> Correct. > > I'll go that way, sticking with MySQL, if the problem won't be solved by new > collations in MariaDB. > >>> >>>> Since the index behaviour obviously depends on the collation, would >>>> building an own collation which does not PADSPACE be an option? I have >>> >>> That would be interesting, actually. I don't know what support there is for >>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)? >> >> We don't have NO PAD collations yet. > > Thanks for making this clear; I haven't found something about it in the > documents. By the way, the document sections which deal with the char data > types could eventually mention these pitfalls (like the MySQL documentation) > :-) Sure, we'll mention this clearer when we add NO PAD collations. > >> I don't remember that anybody ever asked for this before. > > I wonder why, given the dozens (or hundreds?) of threads and tutorials around > the web which try to explain how to circumvent these issues (none of them > provides a general solution besides switching to ...BINARY). > >> [...] >> Sounds like a few hours of work. > > This is very good news. I will now try to give some arguments why this really > should be done. > > 1) Avoid programming mistakes > > People might be programming for decades, but when not being database experts > and thrown into a web development project, even those experienced programmers > won't come to the idea that a database which returns trailing spaces in the > result set of SELECT statements throws away these trailing spaces when > building the index or when comparing (e.g. for evaluation the conditions of > the WHERE CLAUSE). This is completely illogical and dangerous and makes > debugging substantially more difficult. > > 2) Developers (and not DBMSs) should be in control of what happens to > trailing spaces Exactly. The DBMS should supply a full variety of useful collations, so the users have enough flexibility to choose the one that suites best. So I like the idea of new NO PAD collations very much. > > A DBMS must not silently change data, nor for comparing nor for other > purposes. It is solely the decision of the application logic if trailing > spaces are acceptable or desired in data values (possibly entered by users), > but not the decision of the DBMS. Nothing else than the application logic has > to decide what to do with trailing spaces (trim them, keep them, reject user > input, return error messages etc.), and the DBMS has to store, retrieve, > compare and index the results of the application logic as-is. Period. > > However, there is no argument against additional and optional functions or > configuration flags of the DBMS which allow easy trimming or other (automatic > and silent) transformation of the values and all sorts of weird compare > modes, if these are optional. But in the first place, there must also be a > mode / column type for every sort of data, notably for strings, in which data > is stored, compared and indexed as-is. > > We already have the binary data types which in this respect behave like > desired, but all string functions fail with binary data values or behave > surprisingly / impractically, so nearly all string functions (which are very > important in many applications) are lost for binary data type columns. > > 3) There are use cases for string data with trailing spaces > > I have seen some examples of trailing spaces, some of them useless, but some > quite meaningful. In every case, the application logic relied on a string > with trailing space being not equal to the same string without the trailing > space. In some cases, the developers had not thought about the issues the > trailing spaces could cause, i.e. their programs were faulty; in the other > cases, they have put a lot of effort and used the weirdest tricks and index > combinations to circumvent the problems. I can't remember a case where they > have switched to ...BINARY data types. > > Two of the funniest cases (just because it's Friday - skip the next two > sections if you don't want to laugh): > > Some application actually stored an internal status bit at the end of > existing user input, probably because the programmers were in a hurry or had > no administrative access to the MySQL server, i.e. they couldn't create a new > status column in the respective table. Now how could you save a status bit > within an existing string without making it visible to the users? The answer > is of course: Append a space at the end of the string to set the status, > remove it to clear the status. Of course, the respective column had to be > UNIQUE in the sense that two identical user inputs should be allowed if the > "status bit" differed ... > > Some other application generated session keys in string form and used a part > of the ASCII character set for the keys (e.g. a-z, A-Z, 0-9, special chars > like $% and so on, including the space). Of course, as chance would have it, > there could be a trailing space at the end of the session key. I can't > remember if this led to problems in this case, but theoretically, it surely > could have. > > Unfortunately, my own use case is too difficult to explain here. Thanks for use examples. I liked the one about "status bit" very much :) > > To summarize, there are all sorts of serious pitfalls which might be very > difficult to debug and very difficult to foresee if the DBMS ignores trailing > spaces when storing, retrieving, comparing or indexing strings, so I would be > very grateful, and I would immediately switch to MariaDB, if appropriate > collations would be added there. > >> But then thorough testing will be needed, >> which will be the most time consuming part. > > Probably, but I have a feeling that *not* including trailing spaces in > comparisons / collations is much more difficult than including them. > A trailing space is just a char as every other char, isn't it? So I suppose > the structure of the respective functions for _nopad_ collations will be much > less complicated than the structure of the existing functions. Yes. I think so. The idea is just to disable the code that ignores trailing spaces. > Given that, the testing eventually wouldn't consume too much time ... We'll need to cover thoroughly many thing, including but not limited to: - DISTINCT - UNION - ORDER BY with indexes - ORDER BY without indexes - GROUP BY - Mixing NO PAD and PAD collations - Function output - GROUP_CONCAT(DISTINCT) > >> Also, for consistency, it's worthy to implement >> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, >> and then eventually NO PAD collations for all other character sets. > > Personally, I would be totally happy if I had the three new utf8 collations > (I probably will never use anything else than utf8 for strings). We're interested in implementing NO PAD counterparts for the default and the _bin collations under terms of Google Summer Of Code 2016 (GSoC). By this will take a few months. I guess you need this more urgently. Question: would utf8_unicode_nopad_ci work for you for now? We could implement it separately, without having to wait until the end of GSoC. > > Thank you very much! > > Binarus > > _______________________________________________ > Mailing list: https://launchpad.net/~maria-discuss > Post to : maria-discuss@lists.launchpad.net > Unsubscribe : https://launchpad.net/~maria-discuss > More help : https://help.launchpad.net/ListHelp > _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp