Am 23.07.2025 um 04:25 schrieb Thomas Wolff via Cygwin:
Am 22.07.2025 um 17:09 schrieb Thomas Wolff via Cygwin:
Am 22.07.2025 um 15:05 schrieb Corinna Vinschen:
On Jul 22 05:38, Thomas Wolff via Cygwin wrote:
Am 27.06.2025 um 12:30 schrieb Corinna Vinschen via Cygwin:
On Jun 26 19:07, Christian Franke via Cygwin wrote:
With some trial and error I found a testcase for this more
serious problem
reported yesterday but not quoted above:
In cases like file3-... above, the converted Windows path ends
with
0xF000. This suggests that this is an accidental conversion of the
terminating null to the 0xF0xx range.
In some cases, the created Windows file name has random garbage
behind the 0xF000. Then even Cygwin is not able to access or
unlink
the file after creation.
Testcase (attached):
Thanks for the testcase!
I found the problem in the newlib core function creating wchar_t from
UTF-8 input. In case of 4 byte UTF-8 sequences, the code created the
low surrogate already after reading byte 3, without checking if
byte 4
of the UTF-8 sequence is a valid byte. Hilarity ensues.
I'm afraid the fix may have broken mbrtowc as I just reported to
the list,
with a test case, thus also breaking mintty.
The low surrogate MUST be created after byte 3 because otherwise
the high
surrogate cannot be delivered after byte 4 as it needs to.
I think it's a drawback of UTF-16 that must be swallowed, even if some
incorrect sequences slip through somehow.
Bummer. What bugs me most is that you might be right here. It's a bit
late, but we should have defined wchar_t as a 4 byte type back when we
worked on Cygwin 1.7.0... sigh.
mbrtowc() is inherently a bad idea when it comes to UTF-16. It's a
function which only works really correctly for the unicode base plane,
or if wchar_t is big enough.
It's the reason we don't use mbrtowc() if possible. It's better to
call
mbstowcs() or friends and allow at least 3 chars in the wchar_t buffer.
You can't change that in mintty by any chance?
Well, I've started to think about a workaround but it's code I've
never touched before and I'd need to carefully ponder about all kinds
of possible special situations, so my testing effort would be high.
Also, I'd need to implement bytewise mbr collection which is right
now done by that function.
Since not using mbrtowc anymore would leave it still broken (and what
other software may fall into that trap...), I'd prefer a fix of that
function anyway.
I've checked whether to use the old version of mbrtowc from newlib
directly in mintty but it pulls too many dependencies...
I've also checked whether to use _mbrtowc_r instead which is defined
in wchar.h but it does not link.
By the way, discussion and commit log mix up the order: the high
surrogate comes first.
OK, suppose I'd consider to switch to mbs[[n]r]towcs, collecting bytes
until the function gives me a result.
This would work fine as long as I receive only valid sequences. But look
at input string test case
char nonbmp[] = {0xF8, 0x88, 0x8A, 0xAF, 0x2D, 0}; // an invalid
sequence followed by a valid char
The functions only return -1 and (in the case of mbsnrtowcs) do not
advance the input pointer.
So how am I supposed to recognize that the invalid sequence has ended
and a valid character has arrived?
Thomas
Corinna
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple