Eric Blake <ebl...@redhat.com> writes: > On 08/08/2018 07:03 AM, Markus Armbruster wrote: >> The JSON parser treats each half of a surrogate pair as unpaired >> surrogate. Fix it to recognize surrogate pairs. >> >> Signed-off-by: Markus Armbruster <arm...@redhat.com> >> --- >> qobject/json-parser.c | 16 +++++++++++++++- >> tests/check-qjson.c | 3 +-- >> 2 files changed, 16 insertions(+), 3 deletions(-) >> > >> @@ -168,6 +170,18 @@ static QString *parse_string(JSONParserContext *ctxt, >> JSONToken *token) >> cp |= hex2decimal(*ptr); >> } >> + if (cp >= 0xD800 && cp <= 0xDBFF && !leading_surrogate >> + && ptr[1] == '\\' && ptr[2] == 'u') { >> + ptr += 2; >> + leading_surrogate = cp; >> + goto hex; >> + } >> + if (cp >= 0xDC00 && cp <= 0xDFFF && leading_surrogate) { >> + cp &= 0x3FF; >> + cp |= (leading_surrogate & 0x3FF) << 10; >> + cp += 0x010000; >> + } >> + >> if (mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp) < 0) { >> parse_error(ctxt, token, >> "\\u%.4s is not a valid Unicode character", > > Consider "\\udbff\\udfff" - a valid surrogate pair (in terms of being > in range), but which decodes to u+10ffff. Since is_valid_codepoint() > (part of mod_utf8_encode()) rejects it due to (codepoint & 0xfffe) == > 0xfffe, it means we end up printing this error message, but only using > the second half of the surrogate pair. Is that okay?
It's not horrible, but I wouldn't call it okay. I'll try to improve it. > Otherwise, > Reviewed-by: Eric Blake <ebl...@redhat.com> Thanks!