On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut <pe...@eisentraut.org> wrote: > > On 30.04.24 19:39, Jacob Champion wrote: > > Tangentially: Should we maybe rethink pieces of the json_lex_string > > error handling? For example, do we really want to echo an incomplete > > multibyte sequence once we know it's bad? > > I can't quite find the place you might be looking at in > json_lex_string(),
(json_lex_string() reports the beginning and end of the "area of interest" via the JsonLexContext; it's json_errdetail() that turns that into an error message.) > but for the general encoding conversion we have what > would appear to be the same behavior in report_invalid_encoding(), and > we go out of our way there to produce a verbose error message including > the invalid data. We could port something like that to src/common. IMO that'd be more suited for an actual conversion routine, though, as opposed to a parser that for the most part assumes you didn't lie about the input encoding and is just trying not to crash if you're wrong. Most of the time, the parser just copies bytes between delimiters around and it's up to the caller to handle encodings... the exceptions to that are the \uXXXX escapes and the error handling. Offhand, are all of our supported frontend encodings self-synchronizing? By that I mean, is it safe to print a partial byte sequence if the locale isn't UTF-8? (As I type this I'm starting at Shift-JIS, and thinking "probably not.") Actually -- hopefully this is not too much of a tangent -- that further crystallizes a vague unease about the API that I have. The JsonLexContext is initialized with something called the "input_encoding", but that encoding is necessarily also the output encoding for parsed string literals and error messages. For the server side that's fine, but frontend clients have the input_encoding locked to UTF-8, which seems like it might cause problems? Maybe I'm missing code somewhere, but I don't see a conversion routine from json_errdetail() to the actual client/locale encoding. (And the parser does not support multibyte input_encodings that contain ASCII in trail bytes.) Thanks, --Jacob