On Fri, 2025-02-14 at 17:27 -0800, Noah Misch wrote: > I'm attaching a WIP patch from Andres Freund.
I am not suggesting a change, but there's a minor point about the behavior of the replacement that I'd like to highlight: Unicode discusses a choice[1]: "An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors." The patch implements the latter. Escaping: <7A F0 80 80 41 7A> results in: <7A C0 20 C0 20 C0 20 41 7A> The Unicode standard suggests[2] that the former approach may provide more consistency in how it's done, but that doesn't seem important or relevant for our purposes. I'd favor whichever approach results in simpler code. Regards, Jeff Davis [1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G48534 [2] https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G66453