On Sunday 12 January 2025 21:09:38 Lasse Collin wrote: > On 2025-01-10 Pali Rohár wrote: > > Then there is an issue that argv[i] is suppose to be filename argument > > and even after the correct splitting of cmd-to-argv[], due to the > > bestfit, the argv[i] describes other file than the original one in > > wargv[i]. I was thinking more about this issue... but now I'm thinking > > that this is not issue at all. The reason is that very similar > > behavior has also Apple Mac OS X systems, then they are doing > > modified NFD UTF-8 normalization of filenames. > > NFD normalization means that different input strings can refer to the > same file. This can happen with best-fit mapping too. > > With best-fit mapping, the same input string can refer to two different > files depending on if the file is interpret as wide chars or converted > to multibyte. A single input string cannot refer to two different files > due to NFD normalization. > > Thus, I think these things aren't comparable.
Ok, that is truth. So it means that some kind of problem is still there. But this is something which cannot be handled or fixed... > > For fixing 1. the only way is to modify _acmdln and replace all > > problematic characters which are result of bestfit by some > > non-problematic replacement. As application may use _acmdln we have to > > fill it with something which is as much as possible the original > > string, but is not vulnerable anymore. Problematic characters include > > those which are used for cmdline-to-argv[] splitting and also > > characters used for wildcard expansion. > > Doing could be OK most of the time. A contrived counter-example is an > app that takes a human-readable message directly from _acmdln. Then > full best-fit mapping can result in the most readable message. :-) GetCommandLineA(), __p__acmdln(), _acmdln and WinMain()'s lpCmdLine are all affected by this problem. In any case, I think we should mark all these functions and variables as security vulnerable. Is there some attribute for gcc for marking them, so usage would throw security compile warning? lpCmdLine the argument for WinMain() so such thing would require more smart things from compiler as the WinMain() is required (it is entry point), so function usage should not throw warning, but rather usage of its 3rd argument should. > > Maybe we can provide global const variable with extra problematic > > characters which can application define on its own? > > Such features sound neat but I *guess* that very few apps would use such > features. Better keep things simple (and smaller code). Those few that > need customization can do the sanitization themselves; a feature > provided by MinGW-w64 might not be enough for such apps. Ok. > > For fixing 2. I think we can reuse your idea which you proposed. Do > > not use msvcrt/ucrt __getmainargs() function, but rather msvcrt/ucrt > > __wgetmainargs() function and do wchar_t* to char* conversion per > > wargv[i] element. Just this needs to be done in very smart way to > > always do that bestfit mapping because this function must not fail, > > application main() has to be called and bestfit is the most sane > > scenario. > > This should be a low risk change. It would fix the quoting breakage. > Other best-fit issues in argv[] would remain but it's still a > significant improvement. > > However, isn't there an inconsistency if _acmdln is sanitized to avoid > best-fit chars but argv[] can still have them? Yes, there is. But _acmdln usage is smaller than argv[] usage. _acmdln (if is going to be fixed) needs to be sanitized. argv[] does not have. And for application it is better to have as much of data from original cmdline as possible. > > For fixing 3. I think that solution from 2. is enough. > > Starting from wargv[] indeed avoids best-fit in the wildcard expansion > step. > > > Issue 4. is unfixable by the mingw-w64. Only application knows how is > > going to use argv[] or __argv[]. mingw-w64 could help application by > > providing: > > * global bool variable which can disable bestfit mapping and/or call > > application callback function when it happens. For example > > application would want to print some error message and exit. Or > > mingw-w64 can provide generic error message. > > A global bool to detect if best-fit happened in the wargv-to-argv step > could cover a lot use cases already. This way it's possible to provide > a translated error message or do other customizations when displaying > the error. So just providing (from application point of view) readonly bool variable which would be set by mingw-w64 runtime code, if the bestfit happened or not? When introducing such new variable, we would need some macro in header file, so application would know if it can access that new variable or not. To ensure that application can be compiled also with older mingw-w64 version (or with msvc). > > * global bitmap array variable for each argv[i], to contains j-bit set > > if the bestfit was applied for argv[i][j]. > > This sounds neat but I guess this is a too fancy feature in sense that > very few apps would actually use it. > > If one wants it still, the simplest way is to store a boolean at > argv[i][-1]. It only wastes one byte per argument and it survives > argument reordering that getopt_long() can do. argv[i][-1] is strange; it would be better argv[i][strlen(argv[i])+1]. In most cases at &argv[i][-1] is same as &argv[i-1][strlen(argv[i-1])] and argv[] allocation algorithm can be harder to write as it is mostly one linear array. Probably it is really too fancy for application usage... > > Issue 5. is hard. Maybe mingw-w64 can provide wrapper around > > GetCommandLineA() function which will do similar thing as for issue > > 1.? > > I would be cautious with such function overrides but I don't know enough > to form an opinion. > > The UTF-8 code page can create some new issues but it also solves most > of the best-fit trouble. Perhaps at least apps ported from POSIX could > be pushed to UTF-8 usage. Most handle UTF-8 fine on POSIX systems > already. Maybe some day MSYS2 can set it in the default manifest. > > > This dirent.h and readdir functionality is not in msvcrt/ucrt > > libraries. So it is always statically linked into the EXE application > > or DLL library. Also it is not Windows function, so we have to just > > follow POSIX or GNU APIs. > > I was worried about ABI, not API. An ABI change would be a problem if a > DLL has a function like this: > > int foo(DIR *dir); > > If the app and DLL use different statically-linked code and structs for > DIR (and struct dirent), then things go wrong if the app calls > opendir() and passes the resulting pointer to the DLL. One would need > to rebuild both the app and DLL with compatible MinGW-w64 versions. > > I hope such APIs are rare but it's something to keep in mind still. Now I see what you mean. struct dir contains directly struct dirent in the middle. It is not a pointer, but whole structure included. So we have really a problem as the current mingw-w64 opendir() and readdir() functions are not prepared for the change, and are not prepared that they can receive input from future version (e.g. by external DLL library compiled by new future version). And this happens even when application is properly written and does not use any manually allocated memory for struct DIR or struct dirent, and not expect anything about those structs. > > POSIX for struct dirent says: > > https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/basedefs/dirent.h.html > > > > ... define the structure dirent which shall include the following > > members: ino_t d_ino File serial number. > > char d_name[] Filename string of entry. > > ... > > The array d_name in each of these structures is of unspecified size, > > but shall contain a filename of at most {NAME_MAX} bytes followed by a > > terminating null byte. > > > > In limits.h we have: #define NAME_MAX 255 > > I think it should be at least (255 * 3 + 1) which is needed for UTF-8. IIRC 255 * 4 + 1. As one UNICODE code point can be encoded by 4 bytes. > Perhaps (255 * MB_LEN_MAX + 1) is best. Then any file name in any In limits.h we have already #define MB_LEN_MAX 5 So this should be safe. Question is if the 255 is enough. IMHO it needs to be maximal length of the filename (in u16 unit) which can be used by the NT kernel. And I'm not sure what it is right now. > Windows' multibyte character set should fit in d_name[NAME_MAX + 1]. > > NAME_MAX is a POSIX thing so it hopefully it is OK if it only counts > bytes in multibyte strings. For wide chars, d_name[255 + 1] should be > enough already but MAX_PATH (260) doesn't hurt. > > > So I would say that changing sizeof(struct dirent) in mingw-w64 is > > less problematic than changing something related to msvcrt/ucrt ABI > > because it statically linked into DLL or EXE. Also readdir() function > > would be used by the POSIX application, and because different systems > > (Linux, BSD, Solaris, Hurd?) may have different size of these > > structures, POSIX application should be written or prepared of > > unspecified size. So in my opinion changing increasing size of the > > d_name[] member of struct dirent could be possible. > > Right. The only question is if "less problematic" is safe enough. Is > there an application which passes a pointer to "struct dirent" or DIR to > a DLL? If such a thing exist, how to notify that both need to be > rebuilt at the same time. Now I understood what you mean. I guess that there can be DLL library which takes DIR* pointer parameter. And then we have a problem. Maybe we would need type versioning? Like it was with time_t or fpos_t which based on the compile time macro expands either to old (32-bit) or new (64-bit type). > I (hopefully) fixed dirent issues in Gnulib: > > https://lists.gnu.org/archive/html/bug-gnulib/2025-01/msg00087.html > > One cannot copy the code to MinGW-w64 for license reasons but we can > learn from feedback on bug-gnulib still. > > -- > Lasse Collin _______________________________________________ Mingw-w64-public mailing list Mingw-w64-public@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mingw-w64-public