Re: UTF-8 string filtering

Sebastien Marie Sat, 05 Sep 2015 08:53:47 -0700

About the approch, I see one possible drawback: with this API, we
couldn't work on partial string, and we have to manage the whole string
in memory. Depending of the usage, it could be a problem (for large block
processing for example).


On Fri, Sep 04, 2015 at 03:17:31PM +1000, Damien Miller wrote:
> 
> +/*
> + * Attempt to encode a UCS character as a UTF-8 sequence. Returns the number
> + * of characters used or -1 on error (insufficient space or bad code).
> + */
> +static int
> +encode_utf8(u_int32_t c, char *s, size_t slen)
> +{
> +     size_t i, need;
> +     u_char h;
> +
> +     if (c < 0x80) {
> +             if (slen >= 1) {
> +                     s[0] = (char)c;
> +             }
> +             return 1;

I think an error should be returned if slen < 1

> +     } else if (c < 0x800) {
> +             need = 2;
> +             h = 0xc0;
> +     } else if (c < 0x10000) {
> +             need = 3;
> +             h = 0xe0;
> +     } else if (c < 0x200000) {

shouldn't be <= 0x10FFFF instead of < 0x200000 ?

> +             need = 4;
> +             h = 0xf0;
> +     } else {
> +             /* Invalid code point > U+10FFFF */
> +             return -1;
> +     }
> +     if (need > slen)
> +             return -1;
> +     for (i = 0; i < need; i++) {
> +             s[i] = (i == 0 ? h : 0x80);
> +             s[i] |= (c >> (need - i - 1) * 6) & 0x3f;
> +     }
> +     return need;
> +}

-- 
Sebastien Marie

Re: UTF-8 string filtering

Reply via email to