On 2025-07-24 04:30, Corinna Vinschen via Cygwin wrote:
Hi Thomas, hi Christian,
On Jul 23 17:50, Thomas Wolff via Cygwin wrote:
Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin:
On Jul 23 05:44, Thomas Wolff via Cygwin wrote:
What bugs me is that we have the choice between a broken mbrtowc on
one side and a chance to generate broken filenames on the other side.
I did not look into those details, but while characters to be handled by a
terminal come sequentially as a stream, filenames can be handled as a
compound string, isn't that easier to check?
I think we should actually revert fa272e05bbd0 ("wcstombs: also call
__WCTOMB on terminating NUL if output buffer is NULL") and see if we can
fix the filename issue in the Cygwin functions for filename conversion
alone.
Any ideas appreciated.
I think I have a fix. I reverted fa272e05bbd0 so mbrtowc is operating
as before. This should fix mintty.
As for the filename problem, I had another look into the _sys_wcstombs
and _sys_mbstowcs functions.
It occured to me that the algorithm how to handle an invalid MB sequence
is upside down when it comes to invalid UTF8 4 byte sequences.
Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f. This
sequence is converted to a byte sequence in the private use area like this:
0xc2 0x7f -> 0xf0c2 0x007f
So the first byte of the sequence is wrong, so it's converted to 0xf0xx.
At this point, we reset the mbstate and try the mbtowc conversion again
with byte 2. Byte 2 is now a valid single byte. Hence 0xf0c2 0x007f.
Also
0xc2 0xff -> 0xf0c2 0xf0ff
because 0xc2 0xff is not valid and 0xff is not a valid lead byte.
Now consider a broken 3 byte sequence. Same as above:
0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f
Now the 4 byte sequence with a broken 4th byte:
0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f
What's wrong here is the fact that the broken sequence results in
a valid high surrogate and the trailing 4th byte is treated as the
broken sequence.
But in fact the leading three bytes are the broken sequence. The
current algorithm doesn't catch that, because it's already done
and handled. So the innocent 4th byte has to take the punch.
I added a patch to _sys_mbstowcs:
- note the fact we already got a high surrogate
- if the next underlying mbtowc call returns an error, backtrack
to the high surrogate in the output string and overwrite it with
a per-byte sequence in the private use area
- reset mbstate
- retry the next byte after the broken sequence
As far as my testing goes, all cases with broken filenames should
work now. The upcoming test release 3.7.0-0.261.gf21fbcaf583e
will contain the patch.
However, there's one problem left. I added a FIXME comment to
_sys_wcstombs:
FIXME? The conversion of invalid bytes from the private use area
like we do here is not actually necessary. If we skip it, the
generated multibyte string is not identical to the original multibyte
string, but it's equivalent in the sense, that another mbstowcs will
generate the same wide-char string. It would also be identical to
the same string converted by wcstombs. And while the original
multibyte string can't be converted by mbstowcs, this string can.
What does that mean? Consider this UTF8 input string:
0xf0 0x90 0x80 0x2e
mbstowcs: returns -1
sys_mbstowcs: f0f0 f090 f080 002e
Let's convert it back to multibyte:
sys_wcstombs: 0xf0 0x90 0x80 0x2e
wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e
So while sys_wcstombs has special code converting the string back to its
original MB string, wcstombs converts to the CESU-8 representation.
This is transparent. If we convert this CESU-8 string back to
wide-char, the resulting wide-char strings are the same:
mbstowcs: f0f0 f090 f080 002e
sys_mbstowcs: f0f0 f090 f080 002e
So the question here is, shall we keep the special case converting
private use area bytes back to their original byte encoding?
Or shall simply go along with CESU-8 when converting back to multibyte
to keep the string the same as with wcstombs?
There are 15 * SMP as BMP characters, so many non-Western and emoji characters
will be expanded from 4 UTF-8 bytes to 6 CESU-8 bytes, and this is not supported
anywhere as a string representation, designed for internal use only per the TR.
Exempt from this are the characters not valid in a DOS filename.
These will always be converted if we create wide-char filenames.
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retrancher but when there is no more to cut
-- Antoine de Saint-Exupéry
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple