On 08/08/2018 07:03 AM, Markus Armbruster wrote:
Both the lexer and the parser (attempt to) validate UTF-8 in JSON strings.
The commit before previous made the parser reject invalid UTF-8 sequences. Since then, anything the lexer rejects, the parser would reject as well. Thus, the lexer's rejecting is unnecessary for correctness, and harmful for error reporting.
Nice analysis.
However, we want to keep rejecting ASCII control characters in the lexer, because that produces the behavior we want for unclosed strings. We also need to keep rejecting \xFF in the lexer, because we documented that as a way to reset the JSON parser (docs/interop/qmp-spec.txt section 2.6 QGA Synchronization), which means we can't change how we recover from this error now. I wish we hadn't done that.
Or, if we give special meaning to 0xff to cause a lexer reset without also emitting an error message, as a design decision. (Doesn't change this patch - that would be a change on top).
I think we should treat \xFE the same as \xFF.
Reasonable, as it would cover byte-order-marks.
Change the lexer to accept \xC0..\xC1 and \xF5..\xFD. It now rejects only \x00..\x1F and \xFE..\xFF. Error reporting for invalid UTF-8 in strings is much improved, except for \xFE and \xFF. For the example above, the lexer now produces JSON_LCURLY { JSON_STRING "abc\xC0\xAFijk" JSON_COLON : JSON_INTEGER 1 JSON_RCURLY and the parser reports just JSON parse error, invalid UTF-8 sequence in string Signed-off-by: Markus Armbruster <arm...@redhat.com> --- qobject/json-lexer.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
Reviewed-by: Eric Blake <ebl...@redhat.com> -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org