On 2025-01-25 Pali Rohár wrote:
> On Saturday 25 January 2025 18:52:26 Lasse Collin wrote:
> > Even if wide char functions were always used to read filenames,
> > perhaps AreFileApisANSI() needs to be taken into account in
> > readdir() to determine if the wide char names should be converted
> > to CP_ACP or CP_OEMCP?  
> 
> Yes. This is what I mean. If you are using CRT's _findfirst() then it
> will (hopefully) return filenames in narrow encoding, which is also
> need for POSIX readdir(). If you are going to use FindFirstFileW()
> then it is required to do conversion to the correct narrow encoding
> and the encoding is affected by that SetFileApisToOEM() function. At
> least CP_ACP and CP_OEMCP needs to be considered.

Thanks! I might be starting to see how this is supposed to work. It's
much more complicated than I had assumed.

> > WideCharToMultiByte() also has CP_THREAD_ACP on W2k and later. The
> > docs of SetFileApisToOEM() and AreFileApisANSI() say that they are
> > about the process code page, not thread. Maybe CP_ACP is correct
> > then. I haven't tested.  
> 
> If the thread (or process) is switched to OEM encoding then you cannot
> use ANSI (=ACP). But I do not have details when and how this encoding
> insanity is applied (if is really per-thread or just global process).
> This CP_THREAD_ACP looks like even more complicated thing, specially
> how it interact with the setlocale().

I currently suspect that CP_THREAD_ACP isn't relevant here but I don't
actually know.

> Also it is questionable if the _findfirst() and FindFirstFileA()
> differs in encoding based on other CRT / WINAPI settings.
> 
> I have feeling that CRT's setlocale() may change the encoding used by
> the CRT's _findfirst(), but does not affect WINAPI FindFirstFileA().

These are good questions. Some results from experimenting with UCRT:

(1) If UTF-8 is set in application manifest, both CP_ACP and CP_OEMCP
    are CP_UTF8, and SetFileApisToOEM() doesn't seem to change
    anything. Setting a non-UTF-8 locale with setlocale() doesn't affect
    file system APIs (they stay as UTF-8). ___lc_codepage_func() is
    affected still if a non-UTF-8 locale is set.

The rest of the results are without UTF-8 manifest. ACP is 1252 and
OEMCP is 850.

(2) If one calls setlocale(LC_ALL, ".UTF-8"), CRT APIs like
    _findfirst(), _open(), and fopen() use UTF-8. If one then calls
    SetFileApisToOEM(), the CRT APIs stay at UTF-8. The locale doesn't
    affect FindFirstFileA() which uses CP_ACP or CP_OEMCP (not UTF-8).

(3) If one calls setlocale(LC_ALL, ""), sets a non-UTF-8 locale, or
    doesn't call setlocale() at all, SetFileApisToOEM() affects both
    FindFirstFileA() and the CRT functions. So now they stay in sync.

    When file APIs are ANSI, the code page of both CRT and Win32 file
    system APIs stay at 1252 even if one sets a locale with a different
    code page, for example, setlocale(LC_ALL, "uk.1251").
    ___lc_codepage_func() is affected still.

It's as if UTF-8 is a special case in UCRT. If locale is changed to
UTF-8 then it affects CRT file system APIs, otherwise CRT and Win32
APIs both use ACP or OEM.

Perhaps dirent needs to do something like this for WideCharToMultiByte:

    unsigned int cp;
    DWORD flags;

    if (___lc_codepage_func() == CP_UTF8) {
        cp = CP_UTF8;
        flags = WC_ERR_INVALID_CHARS;
    } else {
        cp = AreFileApisANSI() ? CP_ACP : CP_OEMCP;
        flags = WC_NO_BEST_FIT_CHARS;
    }

I hope I made a mistake somewhere and it's actually simpler.

> And in my opinion, POSIX readdir() should follow the encoding which is
> used by the CRT _findfirst().

I agree because that encoding is used by _open(), fopen(), and other CRT
functions too.

-- 
Lasse Collin


_______________________________________________
Mingw-w64-public mailing list
Mingw-w64-public@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to