Andres Freund <and...@anarazel.de> writes: > On 2025-02-15 12:35:45 -0800, Jeff Davis wrote: >> I am not suggesting a change, but there's a minor point about the >> behavior of the replacement that I'd like to highlight: >> Unicode discusses a choice[1]: "An ill-formed subsequence consisting of >> more than one code unit could be treated as a single error or as >> multiple errors."
> It seems completely infeasible to me to to implement the "single error" > approach in a minor version. It'd afaict require non-trivial new > infrastructure. We can't just consume up to the next byte without a high bit, > because some encodings have subsequent bytes that are not guaranteed to have a > high bit set. Yeah. Also I think that probably depends on being able to tell the difference between a first byte and a not-first byte of a multibyte character, something that works in UTF-8 but not necessarily elsewhere. As I commented in the security thread, Unicode's recommendations seem pretty UTF-8-centric; I'm hesitant to adopt them wholesale in code that has to deal with other encodings. The v5 patch seems Good Enough(TM) to me. We can refine it later perhaps; I don't think something like the above would affect anything that external code should care about. regards, tom lane