Re: readdir() returns inaccessible name if file was created with invalid UTF-8

Corinna Vinschen via Cygwin Thu, 24 Jul 2025 03:31:53 -0700

Hi Thomas, hi Christian,

On Jul 23 17:50, Thomas Wolff via Cygwin wrote:
> Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin:
> > On Jul 23 05:44, Thomas Wolff via Cygwin wrote:
> > What bugs me is that we have the choice between a broken mbrtowc on
> > one side and a chance to generate broken filenames on the other side.
> I did not look into those details, but while characters to be handled by a
> terminal come sequentially as a stream, filenames can be handled as a
> compound string, isn't that easier to check?
> 
> > I think we should actually revert fa272e05bbd0 ("wcstombs: also call
> > __WCTOMB on terminating NUL if output buffer is NULL") and see if we can
> > fix the filename issue in the Cygwin functions for filename conversion
> > alone.
> > 
> > Any ideas appreciated.


I think I have a fix.  I reverted fa272e05bbd0 so mbrtowc is operating
as before.  This should fix mintty.

As for the filename problem, I had another look into the _sys_wcstombs
and _sys_mbstowcs functions.

It occured to me that the algorithm how to handle an invalid MB sequence
is upside down when it comes to invalid UTF8 4 byte sequences.

Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f.  This 
sequence is converted to a byte sequence in the private use area like this:

  0xc2 0x7f -> 0xf0c2 0x007f

So the first byte of the sequence is wrong, so it's converted to 0xf0xx.
At this point, we reset the mbstate and try the mbtowc conversion again
with byte 2.  Byte 2 is now a valid single byte.  Hence 0xf0c2 0x007f.
Also

  0xc2 0xff -> 0xf0c2 0xf0ff

because 0xc2 0xff is not valid and 0xff is not a valid lead byte.

Now consider a broken 3 byte sequence.  Same as above:

  0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f

Now the 4 byte sequence with a broken 4th byte:

  0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f

What's wrong here is the fact that the broken sequence results in
a valid high surrogate and the trailing 4th byte is treated as the
broken sequence.

But in fact the leading three bytes are the broken sequence.  The
current algorithm doesn't catch that, because it's already done
and handled.  So the innocent 4th byte has to take the punch.

I added a patch to _sys_mbstowcs:
- note the fact we already got a high surrogate
- if the next underlying mbtowc call returns an error, backtrack
  to the high surrogate in the output string and overwrite it with
  a per-byte sequence in the private use area
- reset mbstate
- retry the next byte after the broken sequence

As far as my testing goes, all cases with broken filenames should
work now.  The upcoming test release 3.7.0-0.261.gf21fbcaf583e
will contain the patch.

However, there's one problem left.  I added a FIXME comment to
_sys_wcstombs:

   FIXME? The conversion of invalid bytes from the private use area
   like we do here is not actually necessary.  If we skip it, the
   generated multibyte string is not identical to the original multibyte
   string, but it's equivalent in the sense, that another mbstowcs will
   generate the same wide-char string.  It would also be identical to
   the same string converted by wcstombs.  And while the original
   multibyte string can't be converted by mbstowcs, this string can.

What does that mean?  Consider this UTF8 input string:

  0xf0 0x90 0x80 0x2e

  mbstowcs:     returns -1
  sys_mbstowcs: f0f0 f090 f080 002e

Let's convert it back to multibyte:

  sys_wcstombs: 0xf0 0x90 0x80 0x2e
  wcstombs:     0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e

So while sys_wcstombs has special code converting the string back to its
original MB string, wcstombs converts to the CESU-8 representation.

This is transparent.  If we convert this CESU-8 string back to
wide-char, the resulting wide-char strings are the same:

  mbstowcs:     f0f0 f090 f080 002e
  sys_mbstowcs: f0f0 f090 f080 002e

So the question here is, shall we keep the special case converting
private use area bytes back to their original byte encoding?

Or shall simply go along with CESU-8 when converting back to multibyte
to keep the string the same as with wcstombs?

Exempt from this are the characters not valid in a DOS filename.
These will always be converted if we create wide-char filenames.


Thanks,
Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Re: readdir() returns inaccessible name if file was created with invalid UTF-8

Reply via email to