On Sunday 26 January 2025 19:06:59 Lasse Collin wrote:
> On 2025-01-25 Pali Rohár wrote:
> > On Saturday 25 January 2025 18:52:26 Lasse Collin wrote:
> > > Even if wide char functions were always used to read filenames,
> > > perhaps AreFileApisANSI() needs to be taken into account in
> > > readdir() to determine if the wide char names should be converted
> > > to CP_ACP or CP_OEMCP?  
> > 
> > Yes. This is what I mean. If you are using CRT's _findfirst() then it
> > will (hopefully) return filenames in narrow encoding, which is also
> > need for POSIX readdir(). If you are going to use FindFirstFileW()
> > then it is required to do conversion to the correct narrow encoding
> > and the encoding is affected by that SetFileApisToOEM() function. At
> > least CP_ACP and CP_OEMCP needs to be considered.
> 
> Thanks! I might be starting to see how this is supposed to work. It's
> much more complicated than I had assumed.
> 
> > > WideCharToMultiByte() also has CP_THREAD_ACP on W2k and later. The
> > > docs of SetFileApisToOEM() and AreFileApisANSI() say that they are
> > > about the process code page, not thread. Maybe CP_ACP is correct
> > > then. I haven't tested.  
> > 
> > If the thread (or process) is switched to OEM encoding then you cannot
> > use ANSI (=ACP). But I do not have details when and how this encoding
> > insanity is applied (if is really per-thread or just global process).
> > This CP_THREAD_ACP looks like even more complicated thing, specially
> > how it interact with the setlocale().
> 
> I currently suspect that CP_THREAD_ACP isn't relevant here but I don't
> actually know.
> 
> > Also it is questionable if the _findfirst() and FindFirstFileA()
> > differs in encoding based on other CRT / WINAPI settings.
> > 
> > I have feeling that CRT's setlocale() may change the encoding used by
> > the CRT's _findfirst(), but does not affect WINAPI FindFirstFileA().
> 
> These are good questions. Some results from experimenting with UCRT:
> 
> (1) If UTF-8 is set in application manifest, both CP_ACP and CP_OEMCP
>     are CP_UTF8, and SetFileApisToOEM() doesn't seem to change
>     anything. Setting a non-UTF-8 locale with setlocale() doesn't affect
>     file system APIs (they stay as UTF-8). ___lc_codepage_func() is
>     affected still if a non-UTF-8 locale is set.
> 
> The rest of the results are without UTF-8 manifest. ACP is 1252 and
> OEMCP is 850.
> 
> (2) If one calls setlocale(LC_ALL, ".UTF-8"), CRT APIs like
>     _findfirst(), _open(), and fopen() use UTF-8. If one then calls
>     SetFileApisToOEM(), the CRT APIs stay at UTF-8. The locale doesn't
>     affect FindFirstFileA() which uses CP_ACP or CP_OEMCP (not UTF-8).
> 
> (3) If one calls setlocale(LC_ALL, ""), sets a non-UTF-8 locale, or
>     doesn't call setlocale() at all, SetFileApisToOEM() affects both
>     FindFirstFileA() and the CRT functions. So now they stay in sync.
> 
>     When file APIs are ANSI, the code page of both CRT and Win32 file
>     system APIs stay at 1252 even if one sets a locale with a different
>     code page, for example, setlocale(LC_ALL, "uk.1251").
>     ___lc_codepage_func() is affected still.
> 
> It's as if UTF-8 is a special case in UCRT. If locale is changed to
> UTF-8 then it affects CRT file system APIs, otherwise CRT and Win32
> APIs both use ACP or OEM.
> 
> Perhaps dirent needs to do something like this for WideCharToMultiByte:
> 
>     unsigned int cp;
>     DWORD flags;
> 
>     if (___lc_codepage_func() == CP_UTF8) {
>         cp = CP_UTF8;
>         flags = WC_ERR_INVALID_CHARS;
>     } else {
>         cp = AreFileApisANSI() ? CP_ACP : CP_OEMCP;
>         flags = WC_NO_BEST_FIT_CHARS;
>     }
> 
> I hope I made a mistake somewhere and it's actually simpler.
> 
> > And in my opinion, POSIX readdir() should follow the encoding which is
> > used by the CRT _findfirst().
> 
> I agree because that encoding is used by _open(), fopen(), and other CRT
> functions too.
> 
> -- 
> Lasse Collin

That is even more complicated than I thought... Thanks for doing these checks.

Maybe it could be a good idea to look into last released version of
source code for UCRT. Such ___lc_codepage_func() / CP_UTF8 /
AreFileApisANSI() / CP_ACP / CP_OEMCP should be there too (if it was
correctly guessed). Maybe there could be some other corner cases?


Slightly off-topic, not related to readdir, but could be interesting to
check, what would happen if you call setlocale(LC_ALL, ".UTF-8") before
__getmainargs() call (which is in mingw-w64 startup code crtexe.c)?
Would this force UCRT to pass argv[] in UTF-8 encoding into main() even
without having UTF-8 manifest?


_______________________________________________
Mingw-w64-public mailing list
Mingw-w64-public@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to