Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol

Michael Paquier Tue, 17 Jan 2017 21:47:19 -0800

On Tue, Dec 20, 2016 at 10:47 AM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> And Heikki has mentioned me that he'd prefer not having an extra
> dependency for the normalization, which is LGPL-licensed by the way.
> So I have looked at the SASLprep business to see what should be done
> to get a complete implementation in core, completely independent of
> anything known.
>
> The first thing is to be able to understand in the SCRAM code if a
> string is UTF-8 or not, and this code is in src/common/. pg_wchar.c
> offers a set of routines exactly for this purpose, which is built with
> libpq but that's not available for src/common/. So instead of moving
> all the file, I'd like to create a new file in src/common/utf8.c which
> includes pg_utf_mblen() and pg_utf8_islegal(). On top of that I think
> that having a routine able to check a full string would be useful for
> many users, as pg_utf8_islegal() can only check one set of characters.
> If the password string is found to be of UTF-8 format, SASLprepare is
> applied. If not, the string is copied as-is with perhaps unexpected
> effects for the client But he's in trouble already if client is not
> using UTF-8.
>
> Then comes the real business... Note that's my first time touching
> encoding, particularly UTF-8 in depth, so please be nice. I may write
> things that are incorrect or sound so from here :)
>
> The second thing is the normalization itself. Per RFC4013, NFKC needs
> to be applied to the string.  The operation is described in [1]
> completely, and it is named as doing 1) a compatibility decomposition
> of the bytes of the string, followed by 2) a canonical composition.
>
> About 1). The compatibility decomposition is defined in [2], "by
> recursively applying the canonical and compatibility mappings, then
> applying the canonical reordering algorithm". Canonical and
> compatibility mapping are some data available in UnicodeData.txt, the
> 6th column of the set defined in [3] to be precise. The meaning of the
> decomposition mappings is defined in [2] as well. The canonical
> decomposition is basically to look for a given UTF-8 character, and
> then apply the multiple characters resulting in its new shape. The
> compatibility mapping should as well be applied, but [5], a perl tool
> called charlint.pl doing this normalization work, does not care about
> this phase... Do we?
>
> About 2)... Once the decomposition has been applied, those bytes need
> to be recomposed using the Canonical_Combining_Class field of
> UnicodeData.txt in [3], which is the 3rd column of the set. Its values
> are defined in [4]. An other interesting thing, charlint.pl [5] does
> not care about this phase. I am wondering if we should as well not
> just drop this part as well...
>
> Once 1) and 2) are done, NKFC is complete, and so is SASLPrepare.
>
> So what we need from Postgres side is a mapping table to, having the
> following fields:
> 1) Hexa sequence of UTF8 character.
> 2) Its canonical combining class.
> 3) The kind of decomposition mapping if defined.
> 4) The decomposition mapping, in hexadecimal format.
> Based on what I looked at, either perl or python could be used to
> process UnicodeData.txt and to generate a header file that would be
> included in the tree. There are 30k entries in UnicodeData.txt, 5k of
> them have a mapping, so that will result in many tables. One thing to
> improve performance would be to store the length of the table in a
> static variable, order the entries by their hexadecimal keys and do a
> dichotomy lookup to find an entry. We could as well use more fancy
> things like a set of tables using a Radix tree using decomposed by
> bytes. We should finish by just doing one lookup of the table for each
> character sets anyway.
>
> In conclusion, at this point I am looking for feedback regarding the
> following items:
> 1) Where to put the UTF8 check routines and what to move.
> 2) How to generate the mapping table using UnicodeData.txt. I'd think
> that using perl would be better.
> 3) The shape of the mapping table, which depends on how many
> operations we want to support in the normalization of the strings.
> The decisions for those items will drive the implementation in one
> sense or another.
>
> [1]: http://www.unicode.org/reports/tr15/#Description_Norm
> [2]: 
> http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Character_Decomposition_Mappings
> [3]: http://www.unicode.org/Public/5.1.0/ucd/UCD.html#UnicodeData.txt
> [4]: 
> http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Canonical_Combining_Class_Values
> [5]: https://www.w3.org/International/charlint/
>
> Heikki, others, thoughts?


FWIW, this patch is on a "waiting on author" state and that's right.
As the discussion on SASLprepare() and the decisions regarding the way
to implement it, or at least have it, are still pending, I am not
planning to move on with any implementation until we have a plan about
what to do. Just using libidn (LGPL) for a first shot is rather
painless but... I am not alone here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol

Reply via email to