On Tue, Dec 20, 2016 at 10:47 AM, Michael Paquier <michael.paqu...@gmail.com> wrote: > And Heikki has mentioned me that he'd prefer not having an extra > dependency for the normalization, which is LGPL-licensed by the way. > So I have looked at the SASLprep business to see what should be done > to get a complete implementation in core, completely independent of > anything known. > > The first thing is to be able to understand in the SCRAM code if a > string is UTF-8 or not, and this code is in src/common/. pg_wchar.c > offers a set of routines exactly for this purpose, which is built with > libpq but that's not available for src/common/. So instead of moving > all the file, I'd like to create a new file in src/common/utf8.c which > includes pg_utf_mblen() and pg_utf8_islegal(). On top of that I think > that having a routine able to check a full string would be useful for > many users, as pg_utf8_islegal() can only check one set of characters. > If the password string is found to be of UTF-8 format, SASLprepare is > applied. If not, the string is copied as-is with perhaps unexpected > effects for the client But he's in trouble already if client is not > using UTF-8. > > Then comes the real business... Note that's my first time touching > encoding, particularly UTF-8 in depth, so please be nice. I may write > things that are incorrect or sound so from here :) > > The second thing is the normalization itself. Per RFC4013, NFKC needs > to be applied to the string. The operation is described in [1] > completely, and it is named as doing 1) a compatibility decomposition > of the bytes of the string, followed by 2) a canonical composition. > > About 1). The compatibility decomposition is defined in [2], "by > recursively applying the canonical and compatibility mappings, then > applying the canonical reordering algorithm". Canonical and > compatibility mapping are some data available in UnicodeData.txt, the > 6th column of the set defined in [3] to be precise. The meaning of the > decomposition mappings is defined in [2] as well. The canonical > decomposition is basically to look for a given UTF-8 character, and > then apply the multiple characters resulting in its new shape. The > compatibility mapping should as well be applied, but [5], a perl tool > called charlint.pl doing this normalization work, does not care about > this phase... Do we? > > About 2)... Once the decomposition has been applied, those bytes need > to be recomposed using the Canonical_Combining_Class field of > UnicodeData.txt in [3], which is the 3rd column of the set. Its values > are defined in [4]. An other interesting thing, charlint.pl [5] does > not care about this phase. I am wondering if we should as well not > just drop this part as well... > > Once 1) and 2) are done, NKFC is complete, and so is SASLPrepare. > > So what we need from Postgres side is a mapping table to, having the > following fields: > 1) Hexa sequence of UTF8 character. > 2) Its canonical combining class. > 3) The kind of decomposition mapping if defined. > 4) The decomposition mapping, in hexadecimal format. > Based on what I looked at, either perl or python could be used to > process UnicodeData.txt and to generate a header file that would be > included in the tree. There are 30k entries in UnicodeData.txt, 5k of > them have a mapping, so that will result in many tables. One thing to > improve performance would be to store the length of the table in a > static variable, order the entries by their hexadecimal keys and do a > dichotomy lookup to find an entry. We could as well use more fancy > things like a set of tables using a Radix tree using decomposed by > bytes. We should finish by just doing one lookup of the table for each > character sets anyway. > > In conclusion, at this point I am looking for feedback regarding the > following items: > 1) Where to put the UTF8 check routines and what to move. > 2) How to generate the mapping table using UnicodeData.txt. I'd think > that using perl would be better. > 3) The shape of the mapping table, which depends on how many > operations we want to support in the normalization of the strings. > The decisions for those items will drive the implementation in one > sense or another. > > [1]: http://www.unicode.org/reports/tr15/#Description_Norm > [2]: > http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Character_Decomposition_Mappings > [3]: http://www.unicode.org/Public/5.1.0/ucd/UCD.html#UnicodeData.txt > [4]: > http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Canonical_Combining_Class_Values > [5]: https://www.w3.org/International/charlint/ > > Heikki, others, thoughts?
FWIW, this patch is on a "waiting on author" state and that's right. As the discussion on SASLprepare() and the decisions regarding the way to implement it, or at least have it, are still pending, I am not planning to move on with any implementation until we have a plan about what to do. Just using libidn (LGPL) for a first shot is rather painless but... I am not alone here. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers