Re: Windows UTF8 system locale

Vladlen Popolitov Wed, 25 Dec 2024 07:56:12 -0800

Noah Misch писал(а) 2024-12-17 02:16:

On Tue, Dec 17, 2024 at 02:29:59AM +1300, Thomas Munro wrote:

On Sun, Dec 15, 2024 at 3:32 PM Noah Misch <n...@leadboat.com> wrote:
> For PostgreSQL, I expect the most obvious problems will arise for rolname and
> datname containing non-UTF8.  For example, pg_dumpall relies on
> appendShellString() to call pg_dump for arbitrary datname.  pg_dumpall would
> get "database ... does not exist".


Right, those catalogues have undefined encoding (the initial problem
my CLUSTER ENCODING proposal started trying to fix) and could even be
different for every row, and Windows wants all strings used in
non-wide environ, argv, file APIs, etc to be valid in the ACP (because
it converts them to UTF-16).  We would get away with it if UTF-8
weren't so picky, but come to think of it, so is SJIS, so maybe this
is not a new problem with $SUBJECT?

Wild guess: 文字化け (= mojibake) when encoded as UTF-8 and then passed in
a command line to CreateProcess() with ACP=SJIS might show the problem
(I just gave that string to iconv -f SJIS -t UTF-8 and it rejected it,
I'm assuming that means it'd do the same sort of thing in that
context).


I wasn't ready to believe it, but 010_dump_connstr indeed fails with

GetACP()==932. We've had test coverage of this for 8+ years, so Igather fewor no runs of the TAP suite on GetACP()==932 systems have everhappened. Wow.


Here's how your particular example traverses the CP932 command line:

CreateProcessA(0xe6 0x96 0x87 0xe5 0xad 0x97 0xe5 0x8c 0x96 0xe3 0x810x91)

argv[1] = e6 96 81 45 ad 97 e5 8c 96 e3 81
GetCommandLineA() = 61 20 e6 96 81 45 ad 97 e5 8c 96 e3 81
GetCommandLineW() = 61 20 8b41 30fb ff6d 601c 55a7 7e3a

It's a shame the implicit conversion here doesn't fail with EILSEQ.  I
can't imagine how anything good can ever have come from lossy,
non-error-raising implicit conversions anywhere near argv[].  On the


It's a shame.

other hand, on Unix we have other problems stemming from the
undefinedness.  What does "copy ... to '/tmp/café.txt" do inside a
LATIN1 database?  macOS: EILSEQ, can't open that file, Linux: sure,
now you have a file whose name is displayed as caf�.txt in your UTF-8
terminal or other software (U+FFFD REPLACEMENT CHARACTER).


GNU ls provides nine options for rendering that name to a terminal:
https://www.gnu.org/software/coreutils/manual/html_node/Formatting-the-file-names.html
https://www.gnu.org/software/coreutils/quotes.html

Non-default option "ls --quoting=literal" does display the "replacement

character" way. It may count as a shame that POSIX pathnames are[0x1,0xFF]

binary strings instead of Unicode character strings, but here we are.

> 2. Just fail if the system option is enabled and we would appendShellString()
>    a non-UTF8 value.

I guess the general version is just: fail if the string is not valid
in the ACP (MB_ERR_INVALID_CHARS).


Roughly that.

With the ACP-matching idea for CLUSTER ENCODING, it *think* it should
become unreachable in the two recommended modes: either those strings
would be pure ASCII, or they'd be in database encoding (same encoding
for all databases enforced) and the ACP would match, so it would all
be aligned without any new conversions being required.  It also has an
UNDEFINED mode so a failed encoding validation there would still be
reachable that way.  Still thinking about it all though.

I see. Interesting. Considering you need to be root to change theACP, I'mdisinclined to bet big on requiring the ACP to match anything aboutencodingsused in PostgreSQL. We might get away with it, but it sounds bad forthe

Poker Tracker use case.


Hi Noah!

 It is excellent investigation done by you in previous emails regarding

this topic. This UTF-8 feature leads to annoying test failure(010_dump_connstr).

I read the articles from links above, and got conclusion, that thisoptionis the user choice to push all programs use UTF-8. It forces UTF-8encoding on the screen,and convert command line to prevent non UTF-8 chars. It is not commonfor Unix worldto change users command line (though we live with CR to CR-NL conversionfrom DOS time),but it is already the actual solution in Windows. Negative drawback ofthis solution -

some programs can stop working (like pg_dumpall test).

Really it is not so bad situation. Even before we had to exclude in thistest some

characters, that can not be passed through a command line: " >  <  | & .

How it can be improved?

At least, this test was intended to check, that pg_dumpall can use allcharacters from 1 to 255with some exceptions. It did its work and found this configuration, whenall characters cannotbe used, OS considers them wrong in command line even if PostgreSQLconsiders them correct.


Option 1
Skip this test for Windows in UTF-8 mode.

Option 2.

Exclude all 8-bit characters for Windows in UTF-8 mode. Now only "excluded for Windows.


Option 3.

Test with some limited list of correct UTF-8 symbols - just in case,that they also works.

It could be 64 2-bytes UTF-8 characters.

It is interesting to look at other opinions.

--
Best regards,

Vladlen Popolitov.

Re: Windows UTF8 system locale

Reply via email to