On 2024-11-11 Pali Rohár wrote:
> If application do not want to fail then double quote in _acmdln must
> not come from other (non double quote) character. Otherwise argv[]
> would be wrongly constructed. And I think that argv[] splitting must
> be done correctly.

I agree it makes sense to do splitting correctly even if parsing is
otherwise permissive. It's unlikely enough that someone has used
fullwidth double quotes in a .bat file to quote filenames and expecting
them to work like ASCII double quotes. :-)

> If application wants to fail when conversion is not lossless then it
> does not matter what would be filled in _acmdln at the time of
> application abort / exit call.

Yes.

> So I think that as a first step overwriting _acmdln can be useful.
> Second step could be to add an option to fail on non-lossless
> conversion.

I'm strongly in favor of making the _exit(255) behavior the default
and requiring opt-in to get permissive mode. Most apps will use
whatever is the default. It's easier to fix a few apps that need the
permissive mode than to teach all other apps to enable the strict mode.

If permissive mode can be enabled the same way as _dowildcard works,
then argv[] could be constructed with the same code as in strict mode.
The current crtexe.c in master has

  for (int i = 0; i < argc; ++i) {
    BOOL conv_was_lossy = TRUE;
    int size = WideCharToMultiByte(CP_ACP, WC_NO_BEST_FIT_CHARS,
        wargv[i], -1,
        NULL, 0,
        NULL, &conv_was_lossy);

but it could become something like this:

  BOOL conv_was_lossy = FALSE;
  BOOL *conv_was_lossy_ptr = _strict_argv ? &conv_was_lossy : NULL;

  for (int i = 0; i < argc; ++i) {
    int size = WideCharToMultiByte(CP_ACP, WC_NO_BEST_FIT_CHARS,
        wargv[i], -1,
        NULL, 0,
        NULL, conv_was_lossy_ptr);

It would allow apps to set _strict_argv = 0 to disable the _exit(255)
usage.

I don't know if permissive mode should use WC_NO_BEST_FIT_CHARS.
Keeping it can avoid some security risks. But maybe it could create new
issues if best-fit conversion happened to be preferred (for example,
command line argument is a message to show to a user).

If WC_NO_BEST_FIT_CHARS has compatibility concerns with very old Windows
versions but one doesn't want best-fit mapping, I wonder if the
lpDefaultChar argument is still usable. One could set it to, for
example, "?" or "_".

The conversion in crtexe.c happens after possible wildcard expasion in
the CRT. Thus a ? from charset conversion won't be a wildcard before
main() is called. But if the app does wildcard expansion on its own,
then ? might be a problematic replacement character. It would be
possible to let app to customize which character to use but at some
point the customization options get complicated. :-)

If _acmdln was overridden and then CRT would parse the command line in
narrow mode, best-fit mapping in wildcard expansion cannot be avoided
or customized. I feel the above idea of using the same argv[] code in
both strict and permissive mode is easier.

One could still override _acmdln even if the startup code doesn't need
it (in case some app reads it still). If best-fit mapping isn't needed,
the simplest method could be using WideCharToMultiByte() to convert
_wcmdln to _acmdln. One would use WC_NO_BEST_FIT_CHARS or set
lpDefaultChar (perhaps avoiding "?" in case app treats it as a
wildcard).

You pointed out that it's possible that Microsoft will fix something
around this issue. I understand it might make sense to wait what they
will do so that we don't create new problems by rushing a fix with
possibly-incompatible behavior into MinGW-w64. :-| I don't have a clear
opinion here, I hope MinGW-w64 maintainers do. :-)

> Does it makes sense to fix this problem in argv[] if we have exactly
> same problem in FindFirstFileA()? argv[] would be just an partial and
> incomplete fix of rather larger issue at all.

Hmm, I guess it still makes sense. Not all apps call FindFirstFileA().

I wonder how FindFirstFileA() could be fixed. Perhaps it could skip
problematic filenames and remember that such a name was seen. Then
after listing all files successfully, fail with some error code. This
could create new issues though. :-/

> > FindFirstFileA() and FindFirstFileExA() use best-fit conversion.
> > With UTF-8 code page, only unpaired surrogates are a problem in
> > terms of charset conversion. With UTF-8 one can run into MAX_PATH
> > limitation of WIN32_FIND_DATAA.cFileName though.
> 
> I see, so MS chosen to translate all unpaired surrogates to to UNICODE
> replacement character, and therefore made wchar_t[] to utf8_t[]
> mapping non-bijective.

Yes. Documentation of WideCharToMultiByte() says that in WinXP the
conversion was bijective but it was changed in Vista to not produce
invalid UTF-8 (the code points of surrogates are invalid UTF-8).

I suppose that in practice it should often be good enough that sensible
filenames are accessible via *A() APIs with UTF-8 code page. The lossy
conversions are more troublesome as they can result in access of wrong
files.

-- 
Lasse Collin


_______________________________________________
Mingw-w64-public mailing list
Mingw-w64-public@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to