Eric Blake <ebl...@redhat.com> writes: > On 08/17/2018 10:05 AM, Markus Armbruster wrote: >> For input 0123, the lexer produces the tokens >> >> JSON_ERROR 01 >> JSON_INTEGER 23 >> >> Reporting an error is correct; 0123 is invalid according to RFC 7159. >> But the error recovery isn't nice. >> >> Make the finite state machine eat digits before going into the error >> state. The lexer now produces >> >> JSON_ERROR 0123 >> >> Signed-off-by: Markus Armbruster <arm...@redhat.com> >> Reviewed-by: Eric Blake <ebl...@redhat.com> > > Did you also want to reject invalid attempts at hex numbers, by adding > [xXa-fA-F] to the set of characters eaten by IN_BAD_ZERO?
I put one foot on a slippery slope with this patch... In review of v1, we discussed whether to try matching non-integer numbers with redundant leading zero. Doing that tightly in the lexer requires duplicating six states. A simpler alternative is to have the lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+. Your suggestion for hexadecimal numbers is digit salad with different digits: [0-9a-fA-FxX]. Another option is their union: [0-9a-fA-FxX+-]. Even more radical would be eating anything but whitespace and structural characters: [^][}{:, \t\n\r]. That idea pushed to the limit results in a two-stage lexer: first stage finds token strings, where a token string is a structural character or a sequence of non-structural, non-whitespace characters, second stage rejects invalid token strings. Hmm, we could try to recover from lexical errors more smartly in general: instead of ending the JSON error token after the first offending character, end it before the first whitespace or structural character following the offending character. I can try that, but I'd prefer to try it in a follow-up patch. >> + [IN_BAD_ZERO] = { >> + ['0' ... '9'] = IN_BAD_ZERO, >> + }, >> +