Eric Blake <ebl...@redhat.com> writes: > On 08/08/2018 07:02 AM, Markus Armbruster wrote: >> We reject bytes that can't occur in valid UTF-8 (\xC0..\xC1, >> \xF5..\xFF in the lexer. That's insufficient; there's plenty of >> invalid UTF-8 not containing these bytes, as demonstrated by >> check-qjson: >> >> * Malformed sequences >> >> - Unexpected continuation bytes >> >> - Missing continuation bytes after start bytes other than >> \xC0..\xC1, \xF5..\xFD. >> >> * Overlong sequences with start bytes other than \xC0..\xC1, >> \xF5..\xFD. >> >> * Invalid code points >> >> Fixing this in the lexer would be bothersome. Fixing it in the parser >> is straightforward, so do that. >> >> Signed-off-by: Markus Armbruster <arm...@redhat.com> >> --- > >> @@ -193,12 +198,15 @@ static QString >> *qstring_from_escaped_str(JSONParserContext *ctxt, >> goto out; >> } >> } else { >> - char dummy[2]; >> - >> - dummy[0] = *ptr; >> - dummy[1] = 0; >> - >> - qstring_append(str, dummy); >> + cp = mod_utf8_codepoint(ptr, 6, &end); > > Why are you hard-coding 6 here, rather than computing min(6, > strchr(ptr,0)-ptr)? If the user passes an invalid sequence at the end > of the string, can we end up making mod_utf8_codepoint() read beyond > the end of our string? Would it be better to just always pass the > remaining string length (mod_utf8_codepoint() only cares about > stopping short of 6 bytes, but never reads beyond there even if you > pass a larger number)?
mod_utf8_codepoint() never reads beyond '\0'. The second parameter exists only so you can further limit reads. I like to provide that capability, because it sometimes saves a silly substring copy. >> + if (cp <= 0) { >> + parse_error(ctxt, token, "invalid UTF-8 sequence in >> string"); >> + goto out; >> + } >> + ptr = end - 1; >> + len = mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp); >> + assert(len >= 0); >> + qstring_append(str, utf8_buf); >> } >> } >> > >> +++ b/util/unicode.c >> @@ -13,6 +13,21 @@ >> #include "qemu/osdep.h" >> #include "qemu/unicode.h" >> > >> +ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint) >> +{ >> + assert(bufsz >= 5); >> + >> + if (!is_valid_codepoint(codepoint)) { >> + return -1; >> + } >> + >> + if (codepoint > 0 && codepoint <= 0x7F) { >> + buf[0] = codepoint & 0x7F; > > Dead use of binary &. But acceptable for symmetry with the other code > branches. Exactly as dead as ... >> + buf[1] = 0; >> + return 1; >> + } >> + if (codepoint <= 0x7FF) { >> + buf[0] = 0xC0 | ((codepoint >> 6) & 0x1F); ... this one, and ... >> + buf[1] = 0x80 | (codepoint & 0x3F); >> + buf[2] = 0; >> + return 2; >> + } >> + if (codepoint <= 0xFFFF) { >> + buf[0] = 0xE0 | ((codepoint >> 12) & 0x0F); ... this one, and ... >> + buf[1] = 0x80 | ((codepoint >> 6) & 0x3F); >> + buf[2] = 0x80 | (codepoint & 0x3F); >> + buf[3] = 0; >> + return 3; >> + } >> + buf[0] = 0xF0 | ((codepoint >> 18) & 0x07); ... even this one. The last one only because is_valid_codepoint() rejects codepoints > 0x10FFFFu, which is admittedly a non-local argument. I'm debating whether to keep or drop the redundant masking. Got a preference? >> + buf[1] = 0x80 | ((codepoint >> 12) & 0x3F); >> + buf[2] = 0x80 | ((codepoint >> 6) & 0x3F); >> + buf[3] = 0x80 | (codepoint & 0x3F); >> + buf[4] = 0; >> + return 4; >> +} >> > > Overall, looks nice. Thanks!