fore On Wed, Apr 5, 2017 at 7:05 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote: > I will continue tomorrow, but I wanted to report on what I've done so far. > Attached is a new patch version, quite heavily modified. Notable changes so > far:
Great, thanks! > * Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints. > IMHO this makes the tables easier to read (to a human), and they are also > packed slightly more tightly (see next two points), as you can fit more > codepoints in a 16-bit integer. Using directly codepoints is not much consistent with the existing backend, but for the sake of packing things more, OK. > * Reorganize the decomposition table. Instead of a separate binary-searched > table for each "size", have one table that's ordered by codepoint, which > contains a length and offset into another array, where all the decomposed > sequences are. This makes the recomposition step somewhat slower, as it now > has to grovel through an array that contains all decompositions, rather than > just the array that contains decompositions of two characters. But this > isn't performance-critical, readability and tight packing of the tables > matter more. Okay, no objection to that. > * In the lookup table, store singleton decompositions that decompose to a > single character, and the character fits in a uint16, directly in the main > lookup table, instead of storing an index into the other table. This makes > the tables somewhat smaller, as there are a lot of such decompositions. Indeed. > * Use uint32 instead of pg_wchar to represent unicode codepoints. pg_wchar > suggests something that holds multi-byte characters in the OS locale, used > by functions like wcscmp, so avoid that. I added a "typedef uint32 > Codepoint" alias, though, to make it more clear which variables hold Unicode > codepoints rather than e.g. indexes into arrays. > > * Unicode consortium publishes a comprehensive normalization test suite, as > a text file, NormalizationTest.txt. I wrote a small perl and C program to > run all the tests from that test suite, with the normalization function. > That uncovered a number of bugs in the recomposition algorithm, which I then > fixed: I was looking for something like that for some time, thanks! That's here actually: http://www.unicode.org/Public/8.0.0/ucd/NormalizationTest.txt > * decompositions that map to ascii characters cannot be excluded. > * non-canonical and non-starter decompositions need to be excluded. They > are now flagged in the lookup table, so that we only use them during the > decomposition phase, not when recomposing. Okay on that. > TODOs / Discussion points: > > * I'm not sure I like the way the code is organized, I feel that we're > littering src/common with these. Perhaps we should add a > src/common/unicode_normalization directory for all this. pg_utf8_islegal() and pg_utf_mblen() should as well be moved in their own file I think, and wchar.c can use that. > * The list of characters excluded from recomposition is currently hard-coded > in utf_norm_generate.pl. However, that list is available in machine-readable > format, in file CompositionExclusions.txt. Since we're reading most of the > data from UnicodeData.txt, would be good to read the exclusion table from a > file, too. Ouch. Those are present here... http://www.unicode.org/reports/tr41/tr41-19.html#Exclusions Definitely it makes more sense to read them from a file. > * SASLPrep specifies normalization form KC, but it also specifies that some > characters are mapped to space or nothing. Should do those mappings, too. Ah, right. Those ones are here: https://tools.ietf.org/html/rfc3454#appendix-B.1 The conversion tables need some extra tweaking... + else if ((*utf & 0xf0) == 0xe0) + { + if (utf[1] == 0 || utf[2] == 0) + goto invalid; + cp = (*utf++) & 0x0F; + cp = (cp << 6) | ((*utf++) & 0x3F); + cp = (cp << 6) | ((*utf++) & 0x3F); + } + else if ((*utf & 0xf8) == 0xf0) + { + if (utf[1] == 0 || utf[2] == 0|| utf[3] == 0) + goto invalid; + cp = (*utf++) & 0x07; + cp = (cp << 6) | ((*utf++) & 0x3F); + cp = (cp << 6) | ((*utf++) & 0x3F); + cp = (cp << 6) | ((*utf++) & 0x3F); + } I find more readable something like pg_utf2wchar_with_len(), but well... Some typos: s/fore/for/ s/reprsented/represented/ You seem to be fully on it... I can help out with any of those items as needed. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers