On Sunday 26 January 2025 19:06:59 Lasse Collin wrote: > On 2025-01-25 Pali Rohár wrote: > > On Saturday 25 January 2025 18:52:26 Lasse Collin wrote: > > > Even if wide char functions were always used to read filenames, > > > perhaps AreFileApisANSI() needs to be taken into account in > > > readdir() to determine if the wide char names should be converted > > > to CP_ACP or CP_OEMCP? > > > > Yes. This is what I mean. If you are using CRT's _findfirst() then it > > will (hopefully) return filenames in narrow encoding, which is also > > need for POSIX readdir(). If you are going to use FindFirstFileW() > > then it is required to do conversion to the correct narrow encoding > > and the encoding is affected by that SetFileApisToOEM() function. At > > least CP_ACP and CP_OEMCP needs to be considered. > > Thanks! I might be starting to see how this is supposed to work. It's > much more complicated than I had assumed. > > > > WideCharToMultiByte() also has CP_THREAD_ACP on W2k and later. The > > > docs of SetFileApisToOEM() and AreFileApisANSI() say that they are > > > about the process code page, not thread. Maybe CP_ACP is correct > > > then. I haven't tested. > > > > If the thread (or process) is switched to OEM encoding then you cannot > > use ANSI (=ACP). But I do not have details when and how this encoding > > insanity is applied (if is really per-thread or just global process). > > This CP_THREAD_ACP looks like even more complicated thing, specially > > how it interact with the setlocale(). > > I currently suspect that CP_THREAD_ACP isn't relevant here but I don't > actually know. > > > Also it is questionable if the _findfirst() and FindFirstFileA() > > differs in encoding based on other CRT / WINAPI settings. > > > > I have feeling that CRT's setlocale() may change the encoding used by > > the CRT's _findfirst(), but does not affect WINAPI FindFirstFileA(). > > These are good questions. Some results from experimenting with UCRT: > > (1) If UTF-8 is set in application manifest, both CP_ACP and CP_OEMCP > are CP_UTF8, and SetFileApisToOEM() doesn't seem to change > anything. Setting a non-UTF-8 locale with setlocale() doesn't affect > file system APIs (they stay as UTF-8). ___lc_codepage_func() is > affected still if a non-UTF-8 locale is set. > > The rest of the results are without UTF-8 manifest. ACP is 1252 and > OEMCP is 850. > > (2) If one calls setlocale(LC_ALL, ".UTF-8"), CRT APIs like > _findfirst(), _open(), and fopen() use UTF-8. If one then calls > SetFileApisToOEM(), the CRT APIs stay at UTF-8. The locale doesn't > affect FindFirstFileA() which uses CP_ACP or CP_OEMCP (not UTF-8). > > (3) If one calls setlocale(LC_ALL, ""), sets a non-UTF-8 locale, or > doesn't call setlocale() at all, SetFileApisToOEM() affects both > FindFirstFileA() and the CRT functions. So now they stay in sync. > > When file APIs are ANSI, the code page of both CRT and Win32 file > system APIs stay at 1252 even if one sets a locale with a different > code page, for example, setlocale(LC_ALL, "uk.1251"). > ___lc_codepage_func() is affected still. > > It's as if UTF-8 is a special case in UCRT. If locale is changed to > UTF-8 then it affects CRT file system APIs, otherwise CRT and Win32 > APIs both use ACP or OEM. > > Perhaps dirent needs to do something like this for WideCharToMultiByte: > > unsigned int cp; > DWORD flags; > > if (___lc_codepage_func() == CP_UTF8) { > cp = CP_UTF8; > flags = WC_ERR_INVALID_CHARS; > } else { > cp = AreFileApisANSI() ? CP_ACP : CP_OEMCP; > flags = WC_NO_BEST_FIT_CHARS; > } > > I hope I made a mistake somewhere and it's actually simpler. > > > And in my opinion, POSIX readdir() should follow the encoding which is > > used by the CRT _findfirst(). > > I agree because that encoding is used by _open(), fopen(), and other CRT > functions too. > > -- > Lasse Collin
That is even more complicated than I thought... Thanks for doing these checks. Maybe it could be a good idea to look into last released version of source code for UCRT. Such ___lc_codepage_func() / CP_UTF8 / AreFileApisANSI() / CP_ACP / CP_OEMCP should be there too (if it was correctly guessed). Maybe there could be some other corner cases? Slightly off-topic, not related to readdir, but could be interesting to check, what would happen if you call setlocale(LC_ALL, ".UTF-8") before __getmainargs() call (which is in mingw-w64 startup code crtexe.c)? Would this force UCRT to pass argv[] in UTF-8 encoding into main() even without having UTF-8 manifest? _______________________________________________ Mingw-w64-public mailing list Mingw-w64-public@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mingw-w64-public