Eric Blake <ebl...@redhat.com> writes: > On 08/13/2018 02:00 AM, Markus Armbruster wrote: >> Eric Blake <ebl...@redhat.com> writes: >> >>> On 08/10/2018 10:48 AM, Eric Blake wrote: >>>> On 08/08/2018 07:03 AM, Markus Armbruster wrote: >>>>> This is consistent with qobject_to_json(). See commit e2ec3f97680. >>>> >>>> Side note: that commit mentions that on output, ASCII DEL (0x7f) is >>>> always escaped. RFC 7159 does not require it to be escaped on input, >> >> Weird, isn't it? >> >>>> but I wonder if any of your earlier testsuite improvements should >>>> specifically cover \x7f vs. \u007f on input being canonicalized to >>>> \u007f on round trip output. >> >>>From utf8_string(): >> >> /* 2.2.1 1 byte U+007F */ >> { >> "\x7F", >> "\x7F", >> "\\u007F", >> }, >> >> We test parsing of JSON "\x7F" (expecting C string "\x7F"), unparsing of >> that C string (expecting JSON "\\u007F"), and after PATCH 29 parsing of >> that JSON (expecting the C string again). Sufficient? > > Indeed, looks like we have coverage of DEL thanks to the bounds > testing of various interesting UTF-8 inflection points. > >>>>> +++ b/qobject/json-lexer.c >>>>> @@ -93,7 +93,7 @@ >>>>> * interpolation = %((l|ll|I64)[du]|[ipsf]) >>>>> * >>>>> * Note: >>>>> - * - Input must be encoded in UTF-8. >>>>> + * - Input must be encoded in modified UTF-8. >>>> >>>> Worth documenting this in the QMP doc as an explicit extension? >> >> qmp-spec.txt: >> >> The sever expects its input to be encoded in UTF-8, and sends its >> output encoded in ASCII. >> >> The obvious update would be to stick in "modified". >> >>>> In >>>> general, our QMP interfaces that take binary input do so via base64 >>>> encoding, rather than via a modified UTF-8 string - >> >> Agreed. >> >> However, whether QMP has a use for funny characters or not, the JSON >> parser has to handle them *somehow*. "Handle" in the broadest possible >> sense, including "reject". Not including misbehavior like "crash" and >> "silently ignore some input following ASCII NUL". >> > >> >>> So it's really >>> only a question of whether our input engine can pass "\x00" >>> vs. "\\u0000" when we NEED an input NUL, and except for the testsuite, >>> our QAPI schema never really needs an input NUL. >> >> The question is how the JSON parser is to handle "\u0000" escapes in >> JSON strings and NUL bytes anywhere. >> >> The answer for NUL bytes is obvious: reject them just like any other >> byte <= 0x1F, as required by the RFC. > > Yes, rejecting \x00 on input is fine, which leaves only the escape > inside strings: > >> >> My answer for \u0000 is to treat it as much like any other codepoint <= >> 0x1F as possible. Treating it exactly like them isn't possible, because >> NUL bytes in C strings aren't possible. However, \xC0\x80 sequences >> are. > > Yes, for internal processing to use modified UTF-8 so that we can > accept "\\u0000" on input is reasonable. And even nicer if we turn a > QString containing \xc0\x80 back into "\\u0000" on output, so that our > use of modified UTF-8 never even escapes to the QMP client (and then > we don't need a documentation change).
We do, in to_json(): for (; *ptr; ptr = end) { ---> cp = mod_utf8_codepoint(ptr, 6, &end); switch (cp) { [special cases like \n...] default: if (cp < 0) { cp = 0xFFFD; /* replacement character */ } if (cp > 0xFFFF) { /* beyond BMP; need a surrogate pair */ snprintf(buf, sizeof(buf), "\\u%04X\\u%04X", 0xD800 + ((cp - 0x10000) >> 10), 0xDC00 + ((cp - 0x10000) & 0x3FF)); } else if (cp < 0x20 || cp >= 0x7F) { ---> snprintf(buf, sizeof(buf), "\\u%04X", cp); } else { buf[0] = cp; buf[1] = 0; } qstring_append(str, buf); } mod_utf8_codepoint() recognizes \xC0\x80 as codepoint 0. That's an ASCII control character, and those gets emitted using format \\u%04X. >> >> Reasonably simple and consistent, don't you think? >>