Eric Blake <ebl...@redhat.com> writes: > On 08/08/2018 07:03 AM, Markus Armbruster wrote: >> Both lexer and parser reject invalid escape sequences in strings. The >> parser's check is useless. >> > >> >> Drop the lexer's escape sequence checking, and make it accept the same >> characters after '\' it accepts elsewhere in strings. It now produces >> >> JSON_LCURLY { >> JSON_STRING "abc\@ijk" >> JSON_COLON : >> JSON_INTEGER 1 >> JSON_RCURLY >> >> and the parser reports just >> >> JSON parse error, invalid escape sequence in string >> >> While there, fix parse_string()'s inaccurate function comment. > > Worthwhile improvement. > >> >> Signed-off-by: Markus Armbruster <arm...@redhat.com> >> --- >> qobject/json-lexer.c | 72 +++---------------------------------------- >> qobject/json-parser.c | 56 +++++++++++++++++++-------------- >> 2 files changed, 37 insertions(+), 91 deletions(-) > > and shorter! > >> [IN_DQ_STRING_ESCAPE] = { >> - ['b'] = IN_DQ_STRING, >> - ['f'] = IN_DQ_STRING, >> - ['n'] = IN_DQ_STRING, >> - ['r'] = IN_DQ_STRING, >> - ['t'] = IN_DQ_STRING, >> - ['/'] = IN_DQ_STRING, >> - ['\\'] = IN_DQ_STRING, >> - ['\''] = IN_DQ_STRING, >> - ['\"'] = IN_DQ_STRING, >> - ['u'] = IN_DQ_UCODE0, >> + [0x20 ... 0xFD] = IN_DQ_STRING, > > Among other things, this means the parser now has to flag "\u" as an > incomplete escape - but your added testsuite coverage earlier in the > series ensures that we do.
Yes. >> +++ b/qobject/json-parser.c >> @@ -106,30 +106,40 @@ static int hex2decimal(char ch) >> } >> /** >> - * parse_string(): Parse a json string and return a QObject >> + * parse_string(): Parse a JSON string >> * >> - * string > >> + * From RFC 7159 "The JavaScript Object Notation (JSON) Data >> + * Interchange Format": >> + * >> + * char = unescaped / >> + * escape ( >> + * %x22 / ; " quotation mark U+0022 >> + * %x5C / ; \ reverse solidus U+005C >> + * %x2F / ; / solidus U+002F >> + * %x62 / ; b backspace U+0008 >> + * %x66 / ; f form feed U+000C >> + * %x6E / ; n line feed U+000A >> + * %x72 / ; r carriage return U+000D >> + * %x74 / ; t tab U+0009 >> + * %x75 4HEXDIG ) ; uXXXX U+XXXX >> + * escape = %x5C ; \ >> + * quotation-mark = %x22 ; " >> + * unescaped = %x20-21 / %x23-5B / %x5D-10FFFF >> + * >> + * Extensions over RFC 7159: >> + * - Extra escape sequence in strings: >> + * 0x27 (apostrophe) is recognized after escape, too >> + * - Single-quoted strings: >> + * Like double-quoted strings, except they're delimited by %x27 >> + * (apostrophe) instead of %x22 (quotation mark), and can't contain >> + * unescaped apostrophe, but can contain unescaped quotation mark. >> + * >> + * Note: >> + * - Encoding is modified UTF-8. > > That is an extension over RFC 7159. But I'm okay with leaving it in > the Notes section. > >> + * - Invalid Unicode characters are rejected. >> + * - Control characters are rejected by the lexer. > > Worth being explicit that this is 00-1f, fe, and ff? \xFE and \xFF are invalid, not control. What about: * - Invalid Unicode characters are rejected. * - Control characters \x00..\x1F are rejected by the lexer.