Am 23.07.2025 um 04:25 schrieb Thomas Wolff via Cygwin:


Am 22.07.2025 um 17:09 schrieb Thomas Wolff via Cygwin:


Am 22.07.2025 um 15:05 schrieb Corinna Vinschen:
On Jul 22 05:38, Thomas Wolff via Cygwin wrote:
Am 27.06.2025 um 12:30 schrieb Corinna Vinschen via Cygwin:
On Jun 26 19:07, Christian Franke via Cygwin wrote:
With some trial and error I found a testcase for this more serious problem
reported yesterday but not quoted above:

In cases like file3-... above, the converted Windows path ends with
0xF000. This suggests that this is an accidental conversion of the
terminating null to the 0xF0xx range.

In some cases, the created Windows file name has random garbage
behind the 0xF000. Then even Cygwin is not able to access or unlink
the file after creation.
Testcase (attached):
Thanks for the testcase!

I found the problem in the newlib core function creating wchar_t from
UTF-8 input.  In case of 4 byte UTF-8 sequences, the code created the
low surrogate already after reading byte 3, without checking if byte 4
of the UTF-8 sequence is a valid byte. Hilarity ensues.
I'm afraid the fix may have broken mbrtowc as I just reported to the list,
with a test case, thus also breaking mintty.
The low surrogate MUST be created after byte 3 because otherwise the high
surrogate cannot be delivered after byte 4 as it needs to.
I think it's a drawback of UTF-16 that must be swallowed, even if some
incorrect sequences slip through somehow.
Bummer.  What bugs me most is that you might be right here. It's a bit
late, but we should have defined wchar_t as a 4 byte type back when we
worked on Cygwin 1.7.0... sigh.

mbrtowc() is inherently a bad idea when it comes to UTF-16. It's a
function which only works really correctly for the unicode base plane,
or if wchar_t is big enough.

It's the reason we don't use mbrtowc() if possible.  It's better to call
mbstowcs() or friends and allow at least 3 chars in the wchar_t buffer.
You can't change that in mintty by any chance?
Well, I've started to think about a workaround but it's code I've never touched before and I'd need to carefully ponder about all kinds of possible special situations, so my testing effort would be high. Also, I'd need to implement bytewise mbr collection which is right now done by that function. Since not using mbrtowc anymore would leave it still broken (and what other software may fall into that trap...), I'd prefer a fix of that function anyway.
I've checked whether to use the old version of mbrtowc from newlib directly in mintty but it pulls too many dependencies... I've also checked whether to use _mbrtowc_r instead which is defined in wchar.h but it does not link. By the way, discussion and commit log mix up the order: the high surrogate comes first.

OK, suppose I'd consider to switch to mbs[[n]r]towcs, collecting bytes until the function gives me a result. This would work fine as long as I receive only valid sequences. But look at input string test case char nonbmp[] = {0xF8, 0x88, 0x8A, 0xAF, 0x2D, 0}; // an invalid sequence followed by a valid char The functions only return -1 and (in the case of mbsnrtowcs) do not advance the input pointer. So how am I supposed to recognize that the invalid sequence has ended and a valid character has arrived?



Thomas

Corinna






--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Reply via email to