Am 24.07.2025 um 16:08 schrieb Corinna Vinschen:
On Jul 24 15:41, Thomas Wolff via Cygwin wrote:
Am 24.07.2025 um 12:30 schrieb Corinna Vinschen:
What does that mean?  Consider this UTF8 input string:

    0xf0 0x90 0x80 0x2e

    mbstowcs:     returns -1
    sys_mbstowcs: f0f0 f090 f080 002e

Let's convert it back to multibyte:

    sys_wcstombs: 0xf0 0x90 0x80 0x2e
    wcstombs:     0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e

So while sys_wcstombs has special code converting the string back to its
original MB string, wcstombs converts to the CESU-8 representation.

This is transparent.  If we convert this CESU-8 string back to
wide-char, the resulting wide-char strings are the same:

    mbstowcs:     f0f0 f090 f080 002e
    sys_mbstowcs: f0f0 f090 f080 002e

So the question here is, shall we keep the special case converting
private use area bytes back to their original byte encoding?

Or shall simply go along with CESU-8 when converting back to multibyte
to keep the string the same as with wcstombs?

Exempt from this are the characters not valid in a DOS filename.
These will always be converted if we create wide-char filenames.
Sounds like a fair solution with only minor glitches. Poor 4th byte but
thanks a lot anyway.
About the latter decision, if there's no strong bias otherwise, I'd prefer
to drop special handling (but don't take my vote, I don't care so much about
that).
Thanks for your input.

As another datapoint we have to consider how sys_wcstombs is used.

wcstombs on a filename will be used by the application only, and only if
the filename is incoming application level data or has been converted to a
wide char by the application itself.

sys_wcstombs will be used to generate a readable multi-byte filename from
UTF-16 filenames read from the filesystem.  So it's major use in terms of
filenames is by readdir().

Knowing that, the question boils down to this:

Do we want readdir() returning the same name as given to open(), or is
CESU-8 sufficent?
You mean for "normal" cases (i.e. proper non-BMP characters, not invalid stuff or handled special or private range characters)? In that case, I'd not expect or wish to handle CESU-8, as an application developer.
Thomas



Corinna


--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Reply via email to