Re: UTF-8 string filtering

Stefan Sperling Sat, 05 Sep 2015 04:28:05 -0700

On Fri, Sep 04, 2015 at 03:17:31PM +1000, Damien Miller wrote:
> +/* Check whether we can display UTF-8 safely */
> +static int
> +utf8_ok(void)
> +{
> +     static int ret = -1;
> +     char *cp;
> +
> +     if (ret == -1) {
> +             setlocale(LC_CTYPE, "");


As discussed in the other post, you don't need to call setlocale() here.
Instead, call setlocale() just once during program startup, and check
if the current locale charset is UTF-8, as done here:

> +             cp = nl_langinfo(CODESET);
> +             ret = strcmp(cp, "UTF-8") == 0;

As a minor optimization, this function could check a global flag such
as locale_codeset_is_utf8 which is initialized at program startup.
You can safely assume the locale won't change unless your program
calls setlocale().

> +/* Unassigned characters in Unicode 6.2 */
> +static const struct u32_range unassigned[] = {

You could ask afresh1@ to generate such lists for you from Perl.
See src/share/locale/ctype/gen_ctype_utf8.pl

> +static u_int32_t
> +decode_utf8(const char *in, const char **nextc, int *had_error)
> +{

Please make sure this function performs the same validation checks
as src/lib/libc/citrus/citrus_utf8.c:_citrus_utf8_ctype_mbrtowc() 

And if that libc function is missing checks you're doing here we should
discuss about aligning the two.

> + * Attempt to encode a UCS character as a UTF-8 sequence. Returns the number
> + * of characters used or -1 on error (insufficient space or bad code).
> + */
> +static int
> +encode_utf8(u_int32_t c, char *s, size_t slen)
> +{

And we should cross-check this function with _citrus_utf8_ctype_wcrtomb()
and improve whichever of the two is less robust.

> +/*
> + * Normalise a UTF-8 string using the RFC3454 stringprep algorithm.
> + * Returns 0 on success or -1 on failure (prohibited code or insufficient
> + * length in the output string.
> + * Requires an output buffer at most the same length as the input.
> + */
> +int
> +utf8_stringprep(const char *in, char *out, size_t olen)
> +{
> +     int r;
> +     size_t o;
> +     u_int32_t c;
> +
> +     if (olen < 1)
> +             return -1;
> +
> +     for (o = 0; (c = decode_utf8(in, &in, NULL)) != 0;) {

At this point, we know the locale is fine with us outputting UTF-8.
This does not imply that the input is in fact valid UTF-8.

I can't see where you're checking for overlong UTF-8 sequences, for example.
We should never treat overlong UTF-8 sequences as valid input.
Why pass NULL for had_error? Why not check for invalid UTF-8 here?

> +             /* Mapping */
> +             if (code_in_table(c, map_to_nothing, sizeof(map_to_nothing)))
> +                     continue;
> +
> +             /* Prohibitied output */
> +             if (code_in_table(c, prohibited, sizeof(prohibited)) &&
> +                 !code_in_table(c, whitelist, sizeof(whitelist)))
> +                     return -1;
> +
> +             /* Map unassigned code points to U+FFFD */
> +             if (code_in_table(c, unassigned, sizeof(unassigned)))
> +                     c = 0xFFFD;
> +
> +             /* Encode the character */
> +             r = encode_utf8(c, out + o, olen - o - 1);

Here we are re-encoding the code point. Given the lack of protection
agsinst overlong UTF-8 sequences, such overlong sequences are now translated
to their valid shortest form. For example, an overlong input sequence of
C0 80 decodes to code point zero and would trick us into storing a '\0'
(the valid UTF-8 encoding of code point zero) in the output.

> +             if (r < 0)
> +                     return -1;
> +             o += r;
> +     }
> +     out[o] = '\0';
> +     return 0;
> +}

Re: UTF-8 string filtering

Reply via email to