Re: [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint()

Markus Armbruster Fri, 22 Mar 2013 02:23:58 -0700

Laszlo Ersek <ler...@redhat.com> writes:

> I've (re-)read the UTF-8 article in wikipedia now... comments below
>
> On 03/14/13 18:49, Markus Armbruster wrote:
>
>> diff --git a/util/unicode.c b/util/unicode.c
>> new file mode 100644
>> index 0000000..954bcf7
>
>> +/**
>> + * mod_utf8_codepoint:
>> + * @s: string encoded in modified UTF-8
>> + * @n: maximum number of bytes to read from @s, if less than 6
>> + * @end: set to end of sequence on return
>> + *
>> + * Convert the modified UTF-8 sequence at the start of @s.  Modified
>> + * UTF-8 is exactly like UTF-8, except U+0000 is encoded as
>> + * "\xC0\x80".
>> + *
>> + * If @s points to an impossible byte (0xFE or 0xFF) or a continuation
>> + * byte, the sequence is invalid, and @end is set to @s + 1
>> + *
>> + * Else, the first byte determines how many continuation bytes are
>> + * expected.  If there are fewer, the sequence is invalid, and @end is
>> + * set to @s + 1 + actual number of continuation bytes.  Else, the
>> + * sequence is well-formed, and @end is set to @s + 1 + expected
>> + * number of continuation bytes.
>
> The wording also covers (number of cont. bytes == 0), OK.
>
> ... One point that wasn't clear to me until I read the code too is that
> "there are fewer" covers both "out of bytes" and "byte available, but
> it's not a cont. byte". Subtle :)


Guilty as charged :)

> The way "*end" is set ensures progress.

Yes.

>> + *
>> + * A well-formed sequence is valid unless it encodes a code point
>> + * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66
>> + * noncharacters, a surrogate code point, or is overlong.  Except the
>> + * overlong sequence "\xC0\x80" is valid.
>> + *
>> + * Conversion succeeds if and only if the sequence is valid.
>> + *
>> + * Returns: the Unicode code point on success, -1 on failure.
>> + */
>> +int mod_utf8_codepoint(const char *s, size_t n, char **end)
>
> The type of "end" follows the strtol() prototype. I guess I'd prefer
> "const char **end", and "unsigned char" all around actually. (The input
> is binary garbage.) Anyway this is irrelevant. ("const char **end" would
> be probably quite inconvenient for the caller,
> <http://www.c-faq.com/ansi/constmismatch.html>.)

C's type system is far too inexpressive to do const properly.  Best we
could do within the constraints, I guess.  Results in tension between
"declare unchanging things const" (to help compilers as well as human
readers) and "avoid type casts" (because they defeat the type checker
and reduce legibility).

I prefer to err on the side of avoiding casts.  In particular, I try to
make my interfaces so that they can be used without casts as much as
possible.

Thus, I made the first argument const, but not the last.  I guess
similar thinking prevailed when the library part of C was constified;
see strtol() precedence you quoted.

>> +{
>> +    static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 };
>> +    const unsigned char *p;
>> +    unsigned byte, mask, len, i;
>> +    int cp;
>> +
>> +    if (n == 0) {
>> +        *end = (char *)s;
>> +        return 0;
>> +    }
>
> This is a special case (we return the code point U+0000 after looking at
> zero bytes); we can probably expect the caller to know about n==0.

We could make it an error instead.  What's your gut feeling?

>> +
>> +    p = (const unsigned char *)s;
>> +    byte = *p++;
>
> We have n>=1 bytes here, and pointing one past the last one (in case
> n==1) is allowed.
>
>> +    if (byte < 0x80) {
>> +        cp = byte;              /* one byte sequence */
>
> So, since this is modified UTF-8, the U+0000 code point is represented
> with the overlong "\xC0\x80" sequence, and a bare '\0' can never be part
> of the Modified UTF-8 byte stream. (According to wp, '\0' is used in C
> and similar to zero-terminate the string, but I think that's outside the
> scope of the wire format.)
>
> If we find a '\0' here, we report it as code point U+0000 instead of
> rejecting it. One could maybe argue that it's a violation of the
> interface contract (namely, due to '\0' being at offset #0, the caller
> should have passed in n==0), but I assume from the description that the
> caller need not have any idea about the contents, knowing the size
> should be enough.
>
> IOW I'm not sure about the intended use of this function, but if the
> caller is allowed to throw any binary data at it (with correctly
> communicated size of course), then we can misreport U+0000 here. (*)

Good point.

'\0' should be treated as string terminator, exactly like reaching @n.

Let's check what my current code does:

    n == 0:
        set @end to @s, and return 0

    n > 0 && *s == 0:
        set @end to @s + 1, and return 0

Conclusion: you caught a bug.  Thanks, will fix!

>> +    } else if (byte >= 0xFE) {
>> +        cp = -1;                /* impossible bytes 0xFE, 0xFF */
>
> OK, binary 1111111x is as first byte misformed.

Correct.

>> +    } else if ((byte & 0x40) == 0) {
>> +        cp = -1;                /* unexpected continuation byte */
>
> We know here that byte >= 0x80, and continuation bytes look like
> 10xxxxxx, OK.

Correct.

>> +    } else {
>> +        /* multi-byte sequence */
>
> We have one of:
>
> 110xxxxx
> 1110xxxx
> 11110xxx
> 111110xx
> 1111110x

Correct.

>> +        len = 0;
>> +        for (mask = 0x80; byte & mask; mask >>= 1) {
>> +            len++;
>> +        }
>> +        assert(len > 1 && len < 7);
>
> OK, [2..6].
>
> (Maybe use g_assert()? :))

Now you're teasing me!

>> +        cp = byte & (mask - 1);
>
> OK, the only bit set in "mask" matches the leftmost clear bit in "byte".
> (mask-1) selects the xxxx bits in "byte".

Correct.

>> +        for (i = 1; i < len; i++) {
>
> Runs at least once, and as many times as we need continuation bytes. OK.

Correct.

>> +            byte = i < n ? *p : 0;
>
> "p" is valid to evaluate, "*p" may not be, so we check first. OK.
>
>> +            if ((byte & 0xC0) != 0x80) {
>> +                cp = -1;        /* continuation byte missing */
>> +                goto out;
>> +            }
>
> Right, if a byte is available, it must look 10xxxxxx (binary). If we're
> out of bytes, we also take this branch. *end will be left one past the
> actual cont. bytes.

Correct.

>> +            p++;
>> +            cp <<= 6;
>> +            cp |= byte & 0x3F;
>> +        }
>
> We consume the cont. byte: p++ is valid to evaluate, and we shift in the
> low six bits from the cont byte into the codepoint.
>
> The loop runs at most 5 times (for len==6), shifting in (len-1)*6 bits
> at most, in addition to the initial cp assignment. The initial cp
> assignment clusters the masked-in bits at the LSB end:
>
>     mask == (0x80 >> len) - 1
>
> hence the number of masked-in bits in the original cp assignment is
> 7-len (which difference falls into [1..5]). So the total number of bits
> shifted into "cp" is
>
>     7-len + (len-1)*6 == 1 + 5 * len
>
> Since len <= 6, the above is <= 31, which is perfect for our "int"
> (int32_t in practice).

Exactly!

>> +        if (cp > 0x10FFFF) {
>> +            cp = -1;            /* beyond Unicode range */
>
> OK.
>
>> +        } else if ((cp >= 0xFDD0 && cp <= 0xFDEF)
>> +                   || (cp & 0xFFFE) == 0xFFFE) {
>> +            cp = -1;            /* noncharacter */
>
> http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
>
> Interesting. OK.
>
>> +        } else if (cp >= 0xD800 && cp <= 0xDFFF) {
>> +            cp = -1;            /* surrogate code point */
>
> OK.
>
>> +        } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) {
>> +            cp = -1;            /* overlong, not \xC0\x80 */
>> +        }
>> +    }
>> +
>> +out:
>> +    *end = (char *)p;
>> +    return cp;
>> +}
>> 
>
> I think if we want to adhere to Modified UTF-8 strictly, then (*) should
> be fixed. If not, I can give my Rb; this is nice code.

I think it needs fixing.

Thanks for the thorough review!

Re: [Qemu-devel] [PATCH 1/4] unicode: New mod_utf8_codepoint()

Reply via email to