On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pg...@j-davis.com> wrote: > There are a ton of calls to, for example, isspace(), used mostly for > parsing. > > I wouldn't expect a lot of differences in behavior from locale to > locale, like might be the case with iswspace(), but behavior can be > different at least in theory. > > So I guess we're stuck with setlocale()/uselocale() for a while, unless > we're able to move most of those call sites over to an ascii-only > variant.
We do know of a few isspace() calls that are already questionable[1] (should be scanner_isspace(), or something like that). It's not only weird that SELECT ROW('libertà!') is displayed with or without double quote depending (in theory) on your locale, it's also undefined behaviour because we feed individual bytes of a multi-byte sequence to isspace(), so OSes disagree, and in practice we know that macOS and Windows think that the byte 0xa inside 'à' is a space while glibc and FreeBSD don't. Looking at the languages with many sequences containing 0xa0, I guess you'd probably need to be processing CJK text and cross-platform for the difference to become obvious (that was the case for the problem report I analysed): for i in range(1, 0xffff): if (i < 0xd800 or i > 0xdfff) and 0xa0 in chr(i).encode('UTF-8'): print("%04x: %s" % (i, chr(i))) [1] https://www.postgresql.org/message-id/flat/CA%2BHWA9awUW0%2BRV_gO9r1ABZwGoZxPztcJxPy8vMFSTbTfi4jig%40mail.gmail.com