On 08/08/2018 07:03 AM, Markus Armbruster wrote:
Both lexer and parser reject invalid interpolation specifications.
The parser's check is useless.
The lexer ends the token right after the first bad character. This
tends to lead to suboptimal error reporting. For instance, input
[ %11d ]
With no context of other characters on this line, it took me a while to
notice that this was '1' (one) not 'l' (ell), even though my current
font renders the two quite distinctly. (And now you know why base32
avoids 0/1 in its alphabet, to minimize confusion with O/l - see RFC
4648). A better example might be %22d.
produces the tokens
JSON_LSQUARE [
JSON_ERROR %1
JSON_INTEGER 1
And that's in spite of the context you have here, which makes it obvious
that the parser saw an integer.
JSON_KEYWORD d
JSON_RSQUARE ]
The parser then yields an error, an object and two more errors:
error: Invalid JSON syntax
object: 1
error: JSON parse error, invalid keyword
error: JSON parse error, expecting value
Change the lexer to accept [A-Za-z0-9]*[duipsf]. It now produces
That regex doesn't match the code.
JSON_LSQUARE [
JSON_INTERPOLATION %11d
JSON_RSQUARE ]
and the parser reports just
JSON parse error, invalid interpolation '%11d'
Signed-off-by: Markus Armbruster <arm...@redhat.com>
---
qobject/json-lexer.c | 52 +++++++++----------------------------------
qobject/json-parser.c | 1 +
2 files changed, 11 insertions(+), 42 deletions(-)
diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
index 0ea1eae4aa..7a82aab88b 100644
--- a/qobject/json-lexer.c
+++ b/qobject/json-lexer.c
@@ -93,7 +93,8 @@
* (apostrophe) instead of %x22 (quotation mark), and can't contain
* unescaped apostrophe, but can contain unescaped quotation mark.
* - Interpolation:
- * interpolation = %((l|ll|I64)[du]|[ipsf])
+ * The lexer accepts [A-Za-z0-9]*, and leaves rejecting invalid ones
+ * to the parser.
This comment is more apropos. But is it worth spelling "The lexer
accepts %[A-Za-z0-9]*", and/or documenting that recognizing
interpolation during lexing is now optional (thanks to the previous patch)?
@@ -278,6 +238,14 @@ static const uint8_t json_lexer[][256] = {
['\n'] = IN_WHITESPACE,
},
+ /* interpolation */
+ [IN_INTERPOL] = {
+ TERMINAL(JSON_INTERPOL),
+ ['A' ... 'Z'] = IN_INTERPOL,
+ ['a' ... 'z'] = IN_INTERPOL,
+ ['0' ... '9'] = IN_INTERPOL,
+ },
+
Note that we still treat code like "'foo': %#x" as an invalid
interpolation request (it would be a valid printf format), while at the
same time passing "%1" on to the parser (which is not a valid printf
format, so -Wformat would have flagged it). In fact, "%d1" which is
valid for printf as "%d" followed by a literal "1" gets thrown to the
parser as one chunk - but it is not valid in JSON to have %d adjacent to
another integer any more than it is to reject "%d1" as an unknown
sequence. I don't think that catering to the remaining printf
metacharacters (' ', '#', '\'', '-', '*', '+') nor demanding that things
end on a letter is worth the effort, since it just makes the lexer more
complicated when your goal was to do as little as possible to get things
thrown over to the parser.
So even though your lexing is now somewhat different from printf, I
don't see that as a serious drawback (the uses we care about are still
-Wformat clean, and the uses we don't like are properly handled in the
parser).
Perhaps you want to enhance the testsuite (earlier in the series) to
cover "%22d", "%d1", "%1" as various unaccepted interpolation requests
(if you didn't already).
Reviewed-by: Eric Blake <ebl...@redhat.com>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org