On Tuesday 14 January 2025 21:42:18 Lasse Collin wrote:
> On 2025-01-12 Pali Rohár wrote:
> > GetCommandLineA(), __p__acmdln(), _acmdln and WinMain()'s lpCmdLine
> > are all affected by this problem. In any case, I think we should mark
> > all these functions and variables as security vulnerable. Is there
> > some attribute for gcc for marking them, so usage would throw security
> > compile warning? lpCmdLine the argument for WinMain() so such thing
> > would require more smart things from compiler as the WinMain() is
> > required (it is entry point), so function usage should not throw
> > warning, but rather usage of its 3rd argument should.
> 
> The "deprecated" attribute can be used to mark functions and variables.

I know about this and I think that marking GetCommandLineA(),
__p__acmdln() and _acmdln could be done.

> I think it doesn't help with a specific function argument.

I think that deprecated would not work here. I'm not sure if there is
something other.

> I'm not able
> to comment if such marking is good here vs. how noisy or annoying the
> warnings would be.

Well, warnings are warnings. They are always being added by new compiler
versions, so I would not be afraid of adding also in new mingw-w64
version. And security "warning" for me sounds like a good idea.

> > > A global bool to detect if best-fit happened in the wargv-to-argv
> > > step could cover a lot use cases already. This way it's possible to
> > > provide a translated error message or do other customizations when
> > > displaying the error.  
> > 
> > So just providing (from application point of view) readonly bool
> > variable which would be set by mingw-w64 runtime code, if the bestfit
> > happened or not?
> 
> Yes. I would be set to true if at least one of argv[] strings wasn't
> converted losslessly. Apps can then decide what to do with the
> information. It's not possible to provide a full solution inside
> MinGW-w64 alone because different apps have different needs. This
> method keeps the MinGW-w64 portion fairly minimal.

Ok, this is fine.

> One could combine it with the the argv[i][-1] flag. It's not much extra
> code. I don't know if it is worth even that though.
> 
> > When introducing such new variable, we would need some macro in header
> > file, so application would know if it can access that new variable or
> > not. To ensure that application can be compiled also with older
> > mingw-w64 version (or with msvc).
> 
> Yes.
> 
> > argv[i][-1] is strange; it would be better argv[i][strlen(argv[i])+1].
> 
> It might look strange but in some situations it's a convenient method.
> argv[i][strlen(argv[i])+1] is a more complex expression to type and
> also a little less efficient.
> 
> > In most cases at &argv[i][-1] is same as &argv[i-1][strlen(argv[i-1])]
> > and argv[] allocation algorithm can be harder to write as it is mostly
> > one linear array.
> 
> It can be but it doesn't need to be. For example, duplicate_ppstrings()
> in crtexe.c doesn't create a flat array.
> 
> If something about the command line handling is fixed in MinGW-w64, I
> think it's good to start with the warg-to-argv method. Fixing the
> quoting issue is a significant improvement and highly unlikely to cause
> any breakage (unless someone has used fullwidth double quotes as ASCII
> double quote replacements just for fun). So:
> 
>   - Do it with best-fit to minimize compatiblity risks.
>   - Add a global boolean.
>   - Perhaps add argv[i][-1].
> 
> I have attached a draft patch (header bits are missing) and a demo
> program. It has the above features so it's possible to think if the
> extra code is worth it.

I have already WIP something better, to handle all other parts, like not
leaking wide global variables, properly initialize of narrow variables
and fixing also direct usage of __getmainargs function by applications.
All these parts are not handled in tour draft. I need more time to
finish it and do more tests.

I already started to study the startup phase and what it is in crtexe.c
and crtdll.c code. I think that first would be needed to do some cleanup
as without it would be really hard to review any change if is correct or
not. For example how it would affect the case when the entry point is
called multiple times by other thread...

> _acmdln and WinMain() aren't so important at least when considering
> apps ported from POSIX. I don't mean that nothing should be done about
> them, I only mean that the warg-to-argv sounds like good starting
> point.
> 
> > > > In limits.h we have: #define NAME_MAX 255  
> > > 
> > > I think it should be at least (255 * 3 + 1) which is needed for
> > > UTF-8.  
> > 
> > IIRC 255 * 4 + 1. As one UNICODE code point can be encoded by 4 bytes.
> 
> wchar_t is 16 bits on Windows. The highest code point that can be
> encoded using one wchar_t is U+FFFF. A three-byte UTF-8 character holds
> 16 bits, so the last three-byte character is U+FFFF. So one wchar_t
> produces three bytes.
> 
> A four-byte UTF-8 character consumes two wide chars on Windows due
> to surrogate pairs. Using two wide chars to produce four bytes results
> in a smaller byte count than above.

Ok, I see. I must admit that this is really tricky to figure out.

> With GB18030, U+FFFD consumes four bytes. So with that charset, the
> maximum possible byte count is larger.

It is possible to change local process ucrt encoding to GB18030?

> > > Perhaps (255 * MB_LEN_MAX + 1) is best. Then any file name in any  
> > 
> > In limits.h we have already #define MB_LEN_MAX 5
> > So this should be safe.
> > 
> > Question is if the 255 is enough. IMHO it needs to be maximal length
> > of the filename (in u16 unit) which can be used by the NT kernel. And
> > I'm not sure what it is right now.
> 
> I'm not sure. 255 (without \0) tends to be a common limit in practice.
> Multipying by MB_LEN_MAX produces a very excessively-sized buffer for
> UTF-8 use at least.
> 
> I don't know in which charset a code point may encode to five bytes.

I know some encodings which use more than 4 bytes (for example variant
of UTF-EBCDIC uses 5 bytes) but I do not know anything which MS can use.
I think that MS libraries were hardcoded to 3 bytes for a long time and
that was reason why UTF-8 was not possible to set as an application
encoding.

> If it requires a surrogate pair then using 5 as a multiplier is silly.

I think this needs to be investigated. The worst thing which can happen
is that somebody figures out how to change ucrt encoding to something
which is 5 bytes per character and our mingw-w64 would not be prepared
for it.

> > Now I understood what you mean. I guess that there can be DLL library
> > which takes DIR* pointer parameter. And then we have a problem.
> 
> You understood it correctly. :-)
> 
> > Maybe we would need type versioning? Like it was with time_t or fpos_t
> > which based on the compile time macro expands either to old (32-bit)
> > or new (64-bit type).
> 
> If a third party header has
> 
>     include <dirent.h>
>     int foo(DIR *d);
> 
> it's not possible to know which version of the symbols were used when
> the library was compiled. To do versioning with only header macros, all
> participants have to co-operate. Ideally one doesn't use this kind of
> data types in API/ABI at all.

Yes, that is truth. But same thing is already being done for time_t
types in both visual studio and mingw-w64 header files. There are some
defaults and via #define you change the behavior.

Also same was used for a long time by UNIX LFN which changed in this way
fpos_t and off_t types (plus redefined open, lseek and other functions).

> > > I (hopefully) fixed dirent issues in Gnulib:
> > > 
> > >     https://lists.gnu.org/archive/html/bug-gnulib/2025-01/msg00087.html
> 
> This didn't work out but I learned a few things which are useful *if*
> MinGW-w64's dirent code is changed. I try to send something for
> discussion in a few days.
> 
> -- 
> Lasse Collin


_______________________________________________
Mingw-w64-public mailing list
Mingw-w64-public@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to