I like the idea to avoid LoadLibrary at all, I wasn't sure how to achieve it. Thanks for the info about using DATA in def files.
The bug itself seems to be around since msvcr80.dll, where mbrtowc and other C95 conversion functions were introduced. I guess it was unnoticed because it only occurs with specific code pages and in very uncommon case when you pass 1 as `count` argument instead of MB_CUR_MAX. I can't figure myself how to report a bug in CRT to Microsoft, any pointers? If they would fix this in UCRT, this would be amazing. Using escape sequenced instead of literal UTF-8 characters sounds reasonable. In this case I could also replace them in existing tests for conversion functions. - Kirill Makurin ________________________________ From: Martin Storsjö <[email protected]> Sent: Wednesday, August 20, 2025 9:49 PM To: mingw-w64-public <[email protected]> Subject: Re: [Mingw-w64-public] Fix return value of mbrlen and mbrtowc Hi, Side note; patches as attachments are much harder to comment on, than inline patches, with my mail client. Comments on patch 3: > crt: use replacements for C95 conversion functions with UCRT The commit message states that it does this, but it does not say _why_. This is essential information. It may have been conveyed elsewhere (prior discussions on the mailing list, or maybe in prior commit messages as well), but at least for me for jumping in to review it, I don't have that information available. Overall - with older msvcrt.dll, we have had the habit to replace lots of functions with our own, statically linked. With UCRT, the general direction is to work towards not replacing as many CRT functions, if possible. In this case, as far as I've understood, there's a spec bug in the UCRT implementation of this bug. I would like this bug to be outlined in the commit message here. Ideally it would also come with a link to a bugreport to MS about this bug (because they actually do fix these kinds of bugs), and a note about which version of UCRT the bug was observed in. Whether to replace the function or not is of course always a judgement call between the severity of bug and how widespread the usage is. In this case I guess the function isn't that widely used, so not all users would end up statically linking more code, so it may be ok in that sense. Some practical comments on the patch: > +#ifdef _UCRT > +static > +#endif > size_t wcrtomb ( Cases like these are kinda ugly and repetitive. It could be nicer to define a macro like STATIC_ON_UCRT at the start of the file, to avoid some ifdefs. > +#ifdef _UCRT > +#undef wcrtomb > +#undef wcsrtombs > +#endif I don't think we need to undef defines at the end of the file. > +static size_t wcrtomb_init ( > + char *__restrict__ mbc, > + wchar_t wc, > + mbstate_t *__restrict__ state > +) { > + HANDLE ucrt = LoadLibraryW (L"ucrtbase.dll"); > + > + if (ucrt == NULL) { > + fwprintf (stderr, L"Runtime error: failed to obtain handle to > ucrtbase.dll\n"); > + abort (); > + } > + > + __wcrtomb_t real_wcrtomb = (__wcrtomb_t) GetProcAddress (ucrt, "wcrtomb"); Instead of LoadLibrary, we could also consider GetModuleHandle, as the DLL should be expected to already be loaded here. However - this approach doesn't work on UWP. In UWP environments, the regular LoadLibrary and GetModuleHandle are unavailable. (If compiling for such environments, one probably ends up linking with the winstorecompat library, which replaces them with stubs.) Or when looking closer at it, LoadLibrary is being stubbed in both winstorecompat (for Win 8.x "windows store") and windowsappcompat (for Win 10 UWP), while GetModuleHandle is stubbed only in the former, so it may be usable on Win10. But anyway, if we could avoid the need to do this dynamic lookup if possible, it would be much less brittle. In this case, we _know_ the DLL we link against does have these functions, so we don't need to avoid it for the sake of not having a binary loading (like in the case of msvcrt.dll). An alternative approach, is to not comment out the original functions in the .def file, but mark them as "DATA" - which retains the __imp_<func> symbols but skips the <func> symbol. Then we can reference the real ones through __imp_<func>, or rather __MINGW_IMP_SYMBOL(<func>), in code, without requiring the dynamic lookup at all. If callers end up calling e.g. wcrtomb with a dllimport declaration (our headers don't declare it this way right now), this would end up going directly towards the UCRT version though, so it's less complete, but I think it still would be a simpler solution overall. What do you think? For patch 4: > + * Valid UTF-8 character which cannot be represented by a single wchar_t. > + */ > +char UTF8SurrogatePair[] = "🧡"; > #endif I think we'd rather have the fancy utf8 char in the form of escapes, rather than as a literal utf8 char. // Martin _______________________________________________ Mingw-w64-public mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mingw-w64-public _______________________________________________ Mingw-w64-public mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mingw-w64-public
