On 08/08/2018 07:02 AM, Markus Armbruster wrote:
We reject bytes that can't occur in valid UTF-8 (\xC0..\xC1,
\xF5..\xFF in the lexer. That's insufficient; there's plenty of
invalid UTF-8 not containing these bytes, as demonstrated by
check-qjson:
* Malformed sequences
- Unexpected continuation bytes
- Missing continuation bytes after start bytes other than
\xC0..\xC1, \xF5..\xFD.
* Overlong sequences with start bytes other than \xC0..\xC1,
\xF5..\xFD.
* Invalid code points
Fixing this in the lexer would be bothersome. Fixing it in the parser
is straightforward, so do that.
Signed-off-by: Markus Armbruster <arm...@redhat.com>
---
@@ -193,12 +198,15 @@ static QString
*qstring_from_escaped_str(JSONParserContext *ctxt,
goto out;
}
} else {
- char dummy[2];
-
- dummy[0] = *ptr;
- dummy[1] = 0;
-
- qstring_append(str, dummy);
+ cp = mod_utf8_codepoint(ptr, 6, &end);
Why are you hard-coding 6 here, rather than computing min(6,
strchr(ptr,0)-ptr)? If the user passes an invalid sequence at the end
of the string, can we end up making mod_utf8_codepoint() read beyond the
end of our string? Would it be better to just always pass the remaining
string length (mod_utf8_codepoint() only cares about stopping short of 6
bytes, but never reads beyond there even if you pass a larger number)?
+ if (cp <= 0) {
+ parse_error(ctxt, token, "invalid UTF-8 sequence in string");
+ goto out;
+ }
+ ptr = end - 1;
+ len = mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp);
+ assert(len >= 0);
+ qstring_append(str, utf8_buf);
}
}
+++ b/util/unicode.c
@@ -13,6 +13,21 @@
#include "qemu/osdep.h"
#include "qemu/unicode.h"
+ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint)
+{
+ assert(bufsz >= 5);
+
+ if (!is_valid_codepoint(codepoint)) {
+ return -1;
+ }
+
+ if (codepoint > 0 && codepoint <= 0x7F) {
+ buf[0] = codepoint & 0x7F;
Dead use of binary &. But acceptable for symmetry with the other code
branches.
+ buf[1] = 0;
+ return 1;
+ }
+ if (codepoint <= 0x7FF) {
+ buf[0] = 0xC0 | ((codepoint >> 6) & 0x1F);
+ buf[1] = 0x80 | (codepoint & 0x3F);
+ buf[2] = 0;
+ return 2;
+ }
+ if (codepoint <= 0xFFFF) {
+ buf[0] = 0xE0 | ((codepoint >> 12) & 0x0F);
+ buf[1] = 0x80 | ((codepoint >> 6) & 0x3F);
+ buf[2] = 0x80 | (codepoint & 0x3F);
+ buf[3] = 0;
+ return 3;
+ }
+ buf[0] = 0xF0 | ((codepoint >> 18) & 0x07);
+ buf[1] = 0x80 | ((codepoint >> 12) & 0x3F);
+ buf[2] = 0x80 | ((codepoint >> 6) & 0x3F);
+ buf[3] = 0x80 | (codepoint & 0x3F);
+ buf[4] = 0;
+ return 4;
+}
Overall, looks nice.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org