Re: readdir() returns inaccessible name if file was created with invalid UTF-8

Brian Inglis via Cygwin Thu, 24 Jul 2025 08:29:28 -0700

On 2025-07-24 04:30, Corinna Vinschen via Cygwin wrote:

Hi Thomas, hi Christian,


On Jul 23 17:50, Thomas Wolff via Cygwin wrote:

Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin:

On Jul 23 05:44, Thomas Wolff via Cygwin wrote:
What bugs me is that we have the choice between a broken mbrtowc on
one side and a chance to generate broken filenames on the other side.

I did not look into those details, but while characters to be handled by a
terminal come sequentially as a stream, filenames can be handled as a
compound string, isn't that easier to check?

I think we should actually revert fa272e05bbd0 ("wcstombs: also call
__WCTOMB on terminating NUL if output buffer is NULL") and see if we can
fix the filename issue in the Cygwin functions for filename conversion
alone.

Any ideas appreciated.


I think I have a fix.  I reverted fa272e05bbd0 so mbrtowc is operating
as before.  This should fix mintty.

As for the filename problem, I had another look into the _sys_wcstombs
and _sys_mbstowcs functions.

It occured to me that the algorithm how to handle an invalid MB sequence
is upside down when it comes to invalid UTF8 4 byte sequences.

Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f.  This
sequence is converted to a byte sequence in the private use area like this:

   0xc2 0x7f -> 0xf0c2 0x007f

So the first byte of the sequence is wrong, so it's converted to 0xf0xx.
At this point, we reset the mbstate and try the mbtowc conversion again
with byte 2.  Byte 2 is now a valid single byte.  Hence 0xf0c2 0x007f.
Also

   0xc2 0xff -> 0xf0c2 0xf0ff

because 0xc2 0xff is not valid and 0xff is not a valid lead byte.

Now consider a broken 3 byte sequence.  Same as above:

   0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f

Now the 4 byte sequence with a broken 4th byte:

   0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f

What's wrong here is the fact that the broken sequence results in
a valid high surrogate and the trailing 4th byte is treated as the
broken sequence.

But in fact the leading three bytes are the broken sequence.  The
current algorithm doesn't catch that, because it's already done
and handled.  So the innocent 4th byte has to take the punch.

I added a patch to _sys_mbstowcs:
- note the fact we already got a high surrogate
- if the next underlying mbtowc call returns an error, backtrack
   to the high surrogate in the output string and overwrite it with
   a per-byte sequence in the private use area
- reset mbstate
- retry the next byte after the broken sequence

As far as my testing goes, all cases with broken filenames should
work now.  The upcoming test release 3.7.0-0.261.gf21fbcaf583e
will contain the patch.

However, there's one problem left.  I added a FIXME comment to
_sys_wcstombs:

    FIXME? The conversion of invalid bytes from the private use area
    like we do here is not actually necessary.  If we skip it, the
    generated multibyte string is not identical to the original multibyte
    string, but it's equivalent in the sense, that another mbstowcs will
    generate the same wide-char string.  It would also be identical to
    the same string converted by wcstombs.  And while the original
    multibyte string can't be converted by mbstowcs, this string can.

What does that mean?  Consider this UTF8 input string:

   0xf0 0x90 0x80 0x2e

   mbstowcs:     returns -1
   sys_mbstowcs: f0f0 f090 f080 002e

Let's convert it back to multibyte:

   sys_wcstombs: 0xf0 0x90 0x80 0x2e
   wcstombs:     0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e

So while sys_wcstombs has special code converting the string back to its
original MB string, wcstombs converts to the CESU-8 representation.

This is transparent.  If we convert this CESU-8 string back to
wide-char, the resulting wide-char strings are the same:

   mbstowcs:     f0f0 f090 f080 002e
   sys_mbstowcs: f0f0 f090 f080 002e

So the question here is, shall we keep the special case converting
private use area bytes back to their original byte encoding?

Or shall simply go along with CESU-8 when converting back to multibyte
to keep the string the same as with wcstombs?

There are 15 * SMP as BMP characters, so many non-Western and emoji characterswill be expanded from 4 UTF-8 bytes to 6 CESU-8 bytes, and this is not supportedanywhere as a string representation, designed for internal use only per the TR.

Exempt from this are the characters not valid in a DOS filename.
These will always be converted if we create wide-char filenames.

--
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retrancher  but when there is no more to cut
                                -- Antoine de Saint-Exupéry

--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Re: readdir() returns inaccessible name if file was created with invalid UTF-8

Reply via email to