On 2024-11-11 Pali Rohár wrote: > If application do not want to fail then double quote in _acmdln must > not come from other (non double quote) character. Otherwise argv[] > would be wrongly constructed. And I think that argv[] splitting must > be done correctly.
I agree it makes sense to do splitting correctly even if parsing is otherwise permissive. It's unlikely enough that someone has used fullwidth double quotes in a .bat file to quote filenames and expecting them to work like ASCII double quotes. :-) > If application wants to fail when conversion is not lossless then it > does not matter what would be filled in _acmdln at the time of > application abort / exit call. Yes. > So I think that as a first step overwriting _acmdln can be useful. > Second step could be to add an option to fail on non-lossless > conversion. I'm strongly in favor of making the _exit(255) behavior the default and requiring opt-in to get permissive mode. Most apps will use whatever is the default. It's easier to fix a few apps that need the permissive mode than to teach all other apps to enable the strict mode. If permissive mode can be enabled the same way as _dowildcard works, then argv[] could be constructed with the same code as in strict mode. The current crtexe.c in master has for (int i = 0; i < argc; ++i) { BOOL conv_was_lossy = TRUE; int size = WideCharToMultiByte(CP_ACP, WC_NO_BEST_FIT_CHARS, wargv[i], -1, NULL, 0, NULL, &conv_was_lossy); but it could become something like this: BOOL conv_was_lossy = FALSE; BOOL *conv_was_lossy_ptr = _strict_argv ? &conv_was_lossy : NULL; for (int i = 0; i < argc; ++i) { int size = WideCharToMultiByte(CP_ACP, WC_NO_BEST_FIT_CHARS, wargv[i], -1, NULL, 0, NULL, conv_was_lossy_ptr); It would allow apps to set _strict_argv = 0 to disable the _exit(255) usage. I don't know if permissive mode should use WC_NO_BEST_FIT_CHARS. Keeping it can avoid some security risks. But maybe it could create new issues if best-fit conversion happened to be preferred (for example, command line argument is a message to show to a user). If WC_NO_BEST_FIT_CHARS has compatibility concerns with very old Windows versions but one doesn't want best-fit mapping, I wonder if the lpDefaultChar argument is still usable. One could set it to, for example, "?" or "_". The conversion in crtexe.c happens after possible wildcard expasion in the CRT. Thus a ? from charset conversion won't be a wildcard before main() is called. But if the app does wildcard expansion on its own, then ? might be a problematic replacement character. It would be possible to let app to customize which character to use but at some point the customization options get complicated. :-) If _acmdln was overridden and then CRT would parse the command line in narrow mode, best-fit mapping in wildcard expansion cannot be avoided or customized. I feel the above idea of using the same argv[] code in both strict and permissive mode is easier. One could still override _acmdln even if the startup code doesn't need it (in case some app reads it still). If best-fit mapping isn't needed, the simplest method could be using WideCharToMultiByte() to convert _wcmdln to _acmdln. One would use WC_NO_BEST_FIT_CHARS or set lpDefaultChar (perhaps avoiding "?" in case app treats it as a wildcard). You pointed out that it's possible that Microsoft will fix something around this issue. I understand it might make sense to wait what they will do so that we don't create new problems by rushing a fix with possibly-incompatible behavior into MinGW-w64. :-| I don't have a clear opinion here, I hope MinGW-w64 maintainers do. :-) > Does it makes sense to fix this problem in argv[] if we have exactly > same problem in FindFirstFileA()? argv[] would be just an partial and > incomplete fix of rather larger issue at all. Hmm, I guess it still makes sense. Not all apps call FindFirstFileA(). I wonder how FindFirstFileA() could be fixed. Perhaps it could skip problematic filenames and remember that such a name was seen. Then after listing all files successfully, fail with some error code. This could create new issues though. :-/ > > FindFirstFileA() and FindFirstFileExA() use best-fit conversion. > > With UTF-8 code page, only unpaired surrogates are a problem in > > terms of charset conversion. With UTF-8 one can run into MAX_PATH > > limitation of WIN32_FIND_DATAA.cFileName though. > > I see, so MS chosen to translate all unpaired surrogates to to UNICODE > replacement character, and therefore made wchar_t[] to utf8_t[] > mapping non-bijective. Yes. Documentation of WideCharToMultiByte() says that in WinXP the conversion was bijective but it was changed in Vista to not produce invalid UTF-8 (the code points of surrogates are invalid UTF-8). I suppose that in practice it should often be good enough that sensible filenames are accessible via *A() APIs with UTF-8 code page. The lossy conversions are more troublesome as they can result in access of wrong files. -- Lasse Collin _______________________________________________ Mingw-w64-public mailing list Mingw-w64-public@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mingw-w64-public