Hi Thomas, hi Christian, On Jul 23 17:50, Thomas Wolff via Cygwin wrote: > Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin: > > On Jul 23 05:44, Thomas Wolff via Cygwin wrote: > > What bugs me is that we have the choice between a broken mbrtowc on > > one side and a chance to generate broken filenames on the other side. > I did not look into those details, but while characters to be handled by a > terminal come sequentially as a stream, filenames can be handled as a > compound string, isn't that easier to check? > > > I think we should actually revert fa272e05bbd0 ("wcstombs: also call > > __WCTOMB on terminating NUL if output buffer is NULL") and see if we can > > fix the filename issue in the Cygwin functions for filename conversion > > alone. > > > > Any ideas appreciated.
I think I have a fix. I reverted fa272e05bbd0 so mbrtowc is operating as before. This should fix mintty. As for the filename problem, I had another look into the _sys_wcstombs and _sys_mbstowcs functions. It occured to me that the algorithm how to handle an invalid MB sequence is upside down when it comes to invalid UTF8 4 byte sequences. Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f. This sequence is converted to a byte sequence in the private use area like this: 0xc2 0x7f -> 0xf0c2 0x007f So the first byte of the sequence is wrong, so it's converted to 0xf0xx. At this point, we reset the mbstate and try the mbtowc conversion again with byte 2. Byte 2 is now a valid single byte. Hence 0xf0c2 0x007f. Also 0xc2 0xff -> 0xf0c2 0xf0ff because 0xc2 0xff is not valid and 0xff is not a valid lead byte. Now consider a broken 3 byte sequence. Same as above: 0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f Now the 4 byte sequence with a broken 4th byte: 0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f What's wrong here is the fact that the broken sequence results in a valid high surrogate and the trailing 4th byte is treated as the broken sequence. But in fact the leading three bytes are the broken sequence. The current algorithm doesn't catch that, because it's already done and handled. So the innocent 4th byte has to take the punch. I added a patch to _sys_mbstowcs: - note the fact we already got a high surrogate - if the next underlying mbtowc call returns an error, backtrack to the high surrogate in the output string and overwrite it with a per-byte sequence in the private use area - reset mbstate - retry the next byte after the broken sequence As far as my testing goes, all cases with broken filenames should work now. The upcoming test release 3.7.0-0.261.gf21fbcaf583e will contain the patch. However, there's one problem left. I added a FIXME comment to _sys_wcstombs: FIXME? The conversion of invalid bytes from the private use area like we do here is not actually necessary. If we skip it, the generated multibyte string is not identical to the original multibyte string, but it's equivalent in the sense, that another mbstowcs will generate the same wide-char string. It would also be identical to the same string converted by wcstombs. And while the original multibyte string can't be converted by mbstowcs, this string can. What does that mean? Consider this UTF8 input string: 0xf0 0x90 0x80 0x2e mbstowcs: returns -1 sys_mbstowcs: f0f0 f090 f080 002e Let's convert it back to multibyte: sys_wcstombs: 0xf0 0x90 0x80 0x2e wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e So while sys_wcstombs has special code converting the string back to its original MB string, wcstombs converts to the CESU-8 representation. This is transparent. If we convert this CESU-8 string back to wide-char, the resulting wide-char strings are the same: mbstowcs: f0f0 f090 f080 002e sys_mbstowcs: f0f0 f090 f080 002e So the question here is, shall we keep the special case converting private use area bytes back to their original byte encoding? Or shall simply go along with CESU-8 when converting back to multibyte to keep the string the same as with wcstombs? Exempt from this are the characters not valid in a DOS filename. These will always be converted if we create wide-char filenames. Thanks, Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple