Re: [fpc-pascal] json parsing: detecting invalid escape sequences

Michael Van Canneyt via fpc-pascal Tue, 29 Sep 2020 14:23:58 -0700


On Tue, 29 Sep 2020, Benito van der Zander via fpc-pascal wrote:

Hi,
I am supposed to find invalid escape sequences when parsing JSON and replacethem with a user defined fallback. Invalid in the sense that the unicodecodepoint is not defined or a missing surrogate, not syntactically invalid.
For example, any occurrence of \uFFFF and \uDEAD should be replaced by \uffffand \udead respectively. Or alternatively with ???? depending on thesettings.
I think I need to change the JSON scanner to be able to do that.
I could add a callback function OnInvalidEscape: function (escapeStart:pchar): string; of object;Or perhaps OnInvalidEscape: function (unicodePoint,previousUnicodePointSurrogate: integer): string; of object; {although thatwould be troublesome if \uDEAD and \udead are supposed to be replaced with adifferent fallback}Or OnInvalidEscape: function (const escapedString: string[4]): string; ofobject;
The function would return the unescaped value. Alternatively, the currentstring could be passed to it as var parameter, and the function would appendits unescaped value directly.
Or move all unescaping to a callback function, could be called OnUnescape orOnDecodeEscape. So the scanner does not need to decide which escapes areinvalid. Then
if (joUTF8 in Options) or(DefaultSystemCodePage=CP_UTF8) thenS:=Utf8Encode(WideString(WideChar(u1)+WideChar(u2))) // ToDo: use fasterfunction
                      else
S:=String(WideChar(u1)+WideChar(u2)); // WideCharconverts the encoding. Should it warn on loss?
could be replaced by one function call. And if the user does not set acallback function, the scanner would set its own callback function dependingon the option.


Such a function existed some iterations back (although not for the same 
purpose).
You will see that this drastically reduces the speed of the scanner because
of the extra exception handling frames.

I think even the checking of 'valid' escape sequences will already reduce
speed significantly.

While I am interested in improving the scanner, I am not interested in what
is essentially an error-correcting mechanism for faulty JSON.

I am strengthened in by opinion by this part of the various RFCs:

"However, the ABNF in this specification allows member names and
 string values to contain bit sequences that cannot encode Unicode
 characters;"

So I see little point in trying to correct that.

Michael.

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] json parsing: detecting invalid escape sequences

Reply via email to