Laszlo Ersek <ler...@redhat.com> writes: > I've (re-)read the UTF-8 article in wikipedia now... comments below > > On 03/14/13 18:49, Markus Armbruster wrote: > >> diff --git a/util/unicode.c b/util/unicode.c >> new file mode 100644 >> index 0000000..954bcf7 > >> +/** >> + * mod_utf8_codepoint: >> + * @s: string encoded in modified UTF-8 >> + * @n: maximum number of bytes to read from @s, if less than 6 >> + * @end: set to end of sequence on return >> + * >> + * Convert the modified UTF-8 sequence at the start of @s. Modified >> + * UTF-8 is exactly like UTF-8, except U+0000 is encoded as >> + * "\xC0\x80". >> + * >> + * If @s points to an impossible byte (0xFE or 0xFF) or a continuation >> + * byte, the sequence is invalid, and @end is set to @s + 1 >> + * >> + * Else, the first byte determines how many continuation bytes are >> + * expected. If there are fewer, the sequence is invalid, and @end is >> + * set to @s + 1 + actual number of continuation bytes. Else, the >> + * sequence is well-formed, and @end is set to @s + 1 + expected >> + * number of continuation bytes. > > The wording also covers (number of cont. bytes == 0), OK. > > ... One point that wasn't clear to me until I read the code too is that > "there are fewer" covers both "out of bytes" and "byte available, but > it's not a cont. byte". Subtle :)
Guilty as charged :) > The way "*end" is set ensures progress. Yes. >> + * >> + * A well-formed sequence is valid unless it encodes a code point >> + * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66 >> + * noncharacters, a surrogate code point, or is overlong. Except the >> + * overlong sequence "\xC0\x80" is valid. >> + * >> + * Conversion succeeds if and only if the sequence is valid. >> + * >> + * Returns: the Unicode code point on success, -1 on failure. >> + */ >> +int mod_utf8_codepoint(const char *s, size_t n, char **end) > > The type of "end" follows the strtol() prototype. I guess I'd prefer > "const char **end", and "unsigned char" all around actually. (The input > is binary garbage.) Anyway this is irrelevant. ("const char **end" would > be probably quite inconvenient for the caller, > <http://www.c-faq.com/ansi/constmismatch.html>.) C's type system is far too inexpressive to do const properly. Best we could do within the constraints, I guess. Results in tension between "declare unchanging things const" (to help compilers as well as human readers) and "avoid type casts" (because they defeat the type checker and reduce legibility). I prefer to err on the side of avoiding casts. In particular, I try to make my interfaces so that they can be used without casts as much as possible. Thus, I made the first argument const, but not the last. I guess similar thinking prevailed when the library part of C was constified; see strtol() precedence you quoted. >> +{ >> + static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 }; >> + const unsigned char *p; >> + unsigned byte, mask, len, i; >> + int cp; >> + >> + if (n == 0) { >> + *end = (char *)s; >> + return 0; >> + } > > This is a special case (we return the code point U+0000 after looking at > zero bytes); we can probably expect the caller to know about n==0. We could make it an error instead. What's your gut feeling? >> + >> + p = (const unsigned char *)s; >> + byte = *p++; > > We have n>=1 bytes here, and pointing one past the last one (in case > n==1) is allowed. > >> + if (byte < 0x80) { >> + cp = byte; /* one byte sequence */ > > So, since this is modified UTF-8, the U+0000 code point is represented > with the overlong "\xC0\x80" sequence, and a bare '\0' can never be part > of the Modified UTF-8 byte stream. (According to wp, '\0' is used in C > and similar to zero-terminate the string, but I think that's outside the > scope of the wire format.) > > If we find a '\0' here, we report it as code point U+0000 instead of > rejecting it. One could maybe argue that it's a violation of the > interface contract (namely, due to '\0' being at offset #0, the caller > should have passed in n==0), but I assume from the description that the > caller need not have any idea about the contents, knowing the size > should be enough. > > IOW I'm not sure about the intended use of this function, but if the > caller is allowed to throw any binary data at it (with correctly > communicated size of course), then we can misreport U+0000 here. (*) Good point. '\0' should be treated as string terminator, exactly like reaching @n. Let's check what my current code does: n == 0: set @end to @s, and return 0 n > 0 && *s == 0: set @end to @s + 1, and return 0 Conclusion: you caught a bug. Thanks, will fix! >> + } else if (byte >= 0xFE) { >> + cp = -1; /* impossible bytes 0xFE, 0xFF */ > > OK, binary 1111111x is as first byte misformed. Correct. >> + } else if ((byte & 0x40) == 0) { >> + cp = -1; /* unexpected continuation byte */ > > We know here that byte >= 0x80, and continuation bytes look like > 10xxxxxx, OK. Correct. >> + } else { >> + /* multi-byte sequence */ > > We have one of: > > 110xxxxx > 1110xxxx > 11110xxx > 111110xx > 1111110x Correct. >> + len = 0; >> + for (mask = 0x80; byte & mask; mask >>= 1) { >> + len++; >> + } >> + assert(len > 1 && len < 7); > > OK, [2..6]. > > (Maybe use g_assert()? :)) Now you're teasing me! >> + cp = byte & (mask - 1); > > OK, the only bit set in "mask" matches the leftmost clear bit in "byte". > (mask-1) selects the xxxx bits in "byte". Correct. >> + for (i = 1; i < len; i++) { > > Runs at least once, and as many times as we need continuation bytes. OK. Correct. >> + byte = i < n ? *p : 0; > > "p" is valid to evaluate, "*p" may not be, so we check first. OK. > >> + if ((byte & 0xC0) != 0x80) { >> + cp = -1; /* continuation byte missing */ >> + goto out; >> + } > > Right, if a byte is available, it must look 10xxxxxx (binary). If we're > out of bytes, we also take this branch. *end will be left one past the > actual cont. bytes. Correct. >> + p++; >> + cp <<= 6; >> + cp |= byte & 0x3F; >> + } > > We consume the cont. byte: p++ is valid to evaluate, and we shift in the > low six bits from the cont byte into the codepoint. > > The loop runs at most 5 times (for len==6), shifting in (len-1)*6 bits > at most, in addition to the initial cp assignment. The initial cp > assignment clusters the masked-in bits at the LSB end: > > mask == (0x80 >> len) - 1 > > hence the number of masked-in bits in the original cp assignment is > 7-len (which difference falls into [1..5]). So the total number of bits > shifted into "cp" is > > 7-len + (len-1)*6 == 1 + 5 * len > > Since len <= 6, the above is <= 31, which is perfect for our "int" > (int32_t in practice). Exactly! >> + if (cp > 0x10FFFF) { >> + cp = -1; /* beyond Unicode range */ > > OK. > >> + } else if ((cp >= 0xFDD0 && cp <= 0xFDEF) >> + || (cp & 0xFFFE) == 0xFFFE) { >> + cp = -1; /* noncharacter */ > > http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters > > Interesting. OK. > >> + } else if (cp >= 0xD800 && cp <= 0xDFFF) { >> + cp = -1; /* surrogate code point */ > > OK. > >> + } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) { >> + cp = -1; /* overlong, not \xC0\x80 */ >> + } >> + } >> + >> +out: >> + *end = (char *)p; >> + return cp; >> +} >> > > I think if we want to adhere to Modified UTF-8 strictly, then (*) should > be fixed. If not, I can give my Rb; this is nice code. I think it needs fixing. Thanks for the thorough review!