On 2025-01-10 Pali Rohár wrote:
> Then there is an issue that argv[i] is suppose to be filename argument
> and even after the correct splitting of cmd-to-argv[], due to the
> bestfit, the argv[i] describes other file than the original one in
> wargv[i]. I was thinking more about this issue... but now I'm thinking
> that this is not issue at all. The reason is that very similar
> behavior has also Apple Mac OS X systems, then they are doing
> modified NFD UTF-8 normalization of filenames.

NFD normalization means that different input strings can refer to the
same file. This can happen with best-fit mapping too.

With best-fit mapping, the same input string can refer to two different
files depending on if the file is interpret as wide chars or converted
to multibyte. A single input string cannot refer to two different files
due to NFD normalization.

Thus, I think these things aren't comparable.

> For fixing 1. the only way is to modify _acmdln and replace all
> problematic characters which are result of bestfit by some
> non-problematic replacement. As application may use _acmdln we have to
> fill it with something which is as much as possible the original
> string, but is not vulnerable anymore. Problematic characters include
> those which are used for cmdline-to-argv[] splitting and also
> characters used for wildcard expansion.

Doing could be OK most of the time. A contrived counter-example is an
app that takes a human-readable message directly from _acmdln. Then
full best-fit mapping can result in the most readable message. :-)

> Maybe we can provide global const variable with extra problematic
> characters which can application define on its own?

Such features sound neat but I *guess* that very few apps would use such
features. Better keep things simple (and smaller code). Those few that
need customization can do the sanitization themselves; a feature
provided by MinGW-w64 might not be enough for such apps.

> For fixing 2. I think we can reuse your idea which you proposed. Do
> not use msvcrt/ucrt __getmainargs() function, but rather msvcrt/ucrt
> __wgetmainargs() function and do wchar_t* to char* conversion per
> wargv[i] element. Just this needs to be done in very smart way to
> always do that bestfit mapping because this function must not fail,
> application main() has to be called and bestfit is the most sane
> scenario.

This should be a low risk change. It would fix the quoting breakage.
Other best-fit issues in argv[] would remain but it's still a
significant improvement.

However, isn't there an inconsistency if _acmdln is sanitized to avoid
best-fit chars but argv[] can still have them?

> For fixing 3. I think that solution from 2. is enough.

Starting from wargv[] indeed avoids best-fit in the wildcard expansion
step.

> Issue 4. is unfixable by the mingw-w64. Only application knows how is
> going to use argv[] or __argv[]. mingw-w64 could help application by
> providing:
> * global bool variable which can disable bestfit mapping and/or call
>   application callback function when it happens. For example
>   application would want to print some error message and exit. Or
>   mingw-w64 can provide generic error message.

A global bool to detect if best-fit happened in the wargv-to-argv step
could cover a lot use cases already. This way it's possible to provide
a translated error message or do other customizations when displaying
the error.

> * global bitmap array variable for each argv[i], to contains j-bit set
>   if the bestfit was applied for argv[i][j].

This sounds neat but I guess this is a too fancy feature in sense that
very few apps would actually use it.

If one wants it still, the simplest way is to store a boolean at
argv[i][-1]. It only wastes one byte per argument and it survives
argument reordering that getopt_long() can do.

> Issue 5. is hard. Maybe mingw-w64 can provide wrapper around
> GetCommandLineA() function which will do similar thing as for issue
> 1.?

I would be cautious with such function overrides but I don't know enough
to form an opinion.

The UTF-8 code page can create some new issues but it also solves most
of the best-fit trouble. Perhaps at least apps ported from POSIX could
be pushed to UTF-8 usage. Most handle UTF-8 fine on POSIX systems
already. Maybe some day MSYS2 can set it in the default manifest.

> This dirent.h and readdir functionality is not in msvcrt/ucrt
> libraries. So it is always statically linked into the EXE application
> or DLL library. Also it is not Windows function, so we have to just
> follow POSIX or GNU APIs.

I was worried about ABI, not API. An ABI change would be a problem if a
DLL has a function like this:

    int foo(DIR *dir);

If the app and DLL use different statically-linked code and structs for
DIR (and struct dirent), then things go wrong if the app calls
opendir() and passes the resulting pointer to the DLL. One would need
to rebuild both the app and DLL with compatible MinGW-w64 versions.

I hope such APIs are rare but it's something to keep in mind still.

> POSIX for struct dirent says:
> https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/basedefs/dirent.h.html
> 
> ... define the structure dirent which shall include the following
> members: ino_t  d_ino       File serial number.
> char   d_name[]    Filename string of entry.
> ...
> The array d_name in each of these structures is of unspecified size,
> but shall contain a filename of at most {NAME_MAX} bytes followed by a
> terminating null byte.
> 
> In limits.h we have: #define NAME_MAX 255

I think it should be at least (255 * 3 + 1) which is needed for UTF-8.
Perhaps (255 * MB_LEN_MAX + 1) is best. Then any file name in any
Windows' multibyte character set should fit in d_name[NAME_MAX + 1].

NAME_MAX is a POSIX thing so it hopefully it is OK if it only counts
bytes in multibyte strings. For wide chars, d_name[255 + 1] should be
enough already but MAX_PATH (260) doesn't hurt.

> So I would say that changing sizeof(struct dirent) in mingw-w64 is
> less problematic than changing something related to msvcrt/ucrt ABI
> because it statically linked into DLL or EXE. Also readdir() function
> would be used by the POSIX application, and because different systems
> (Linux, BSD, Solaris, Hurd?) may have different size of these
> structures, POSIX application should be written or prepared of
> unspecified size. So in my opinion changing increasing size of the
> d_name[] member of struct dirent could be possible.

Right. The only question is if "less problematic" is safe enough. Is
there an application which passes a pointer to "struct dirent" or DIR to
a DLL? If such a thing exist, how to notify that both need to be
rebuilt at the same time.

I (hopefully) fixed dirent issues in Gnulib:

    https://lists.gnu.org/archive/html/bug-gnulib/2025-01/msg00087.html

One cannot copy the code to MinGW-w64 for license reasons but we can
learn from feedback on bug-gnulib still.

-- 
Lasse Collin


_______________________________________________
Mingw-w64-public mailing list
Mingw-w64-public@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to