Re: readdir() returns inaccessible name if file was created with invalid UTF-8

Thomas Wolff via Cygwin Tue, 22 Jul 2025 20:45:36 -0700


Am 23.07.2025 um 04:25 schrieb Thomas Wolff via Cygwin:

Am 22.07.2025 um 17:09 schrieb Thomas Wolff via Cygwin:
Am 22.07.2025 um 15:05 schrieb Corinna Vinschen:
On Jul 22 05:38, Thomas Wolff via Cygwin wrote:
Am 27.06.2025 um 12:30 schrieb Corinna Vinschen via Cygwin:
On Jun 26 19:07, Christian Franke via Cygwin wrote:
With some trial and error I found a testcase for this moreserious problem
reported yesterday but not quoted above:
In cases like file3-... above, the converted Windows path endswith
0xF000. This suggests that this is an accidental conversion of the
terminating null to the 0xF0xx range.

In some cases, the created Windows file name has random garbage
behind the 0xF000. Then even Cygwin is not able to access orunlink
the file after creation.
Testcase (attached):
Thanks for the testcase!

I found the problem in the newlib core function creating wchar_t from
UTF-8 input.  In case of 4 byte UTF-8 sequences, the code created the
low surrogate already after reading byte 3, without checking ifbyte 4
of the UTF-8 sequence is a valid byte. Hilarity ensues.
I'm afraid the fix may have broken mbrtowc as I just reported tothe list,
with a test case, thus also breaking mintty.
The low surrogate MUST be created after byte 3 because otherwisethe high
surrogate cannot be delivered after byte 4 as it needs to.
I think it's a drawback of UTF-16 that must be swallowed, even if some
incorrect sequences slip through somehow.
Bummer.  What bugs me most is that you might be right here. It's a bit
late, but we should have defined wchar_t as a 4 byte type back when we
worked on Cygwin 1.7.0... sigh.

mbrtowc() is inherently a bad idea when it comes to UTF-16. It's a
function which only works really correctly for the unicode base plane,
or if wchar_t is big enough.
It's the reason we don't use mbrtowc() if possible. It's better tocall
mbstowcs() or friends and allow at least 3 chars in the wchar_t buffer.
You can't change that in mintty by any chance?
Well, I've started to think about a workaround but it's code I'venever touched before and I'd need to carefully ponder about all kindsof possible special situations, so my testing effort would be high.Also, I'd need to implement bytewise mbr collection which is rightnow done by that function.Since not using mbrtowc anymore would leave it still broken (and whatother software may fall into that trap...), I'd prefer a fix of thatfunction anyway.
I've checked whether to use the old version of mbrtowc from newlibdirectly in mintty but it pulls too many dependencies...I've also checked whether to use _mbrtowc_r instead which is definedin wchar.h but it does not link.By the way, discussion and commit log mix up the order: the highsurrogate comes first.

OK, suppose I'd consider to switch to mbs[[n]r]towcs, collecting bytesuntil the function gives me a result.This would work fine as long as I receive only valid sequences. But lookat input string test casechar nonbmp[] = {0xF8, 0x88, 0x8A, 0xAF, 0x2D, 0}; // an invalidsequence followed by a valid charThe functions only return -1 and (in the case of mbsnrtowcs) do notadvance the input pointer.So how am I supposed to recognize that the invalid sequence has endedand a valid character has arrived?


Thomas

Corinna



--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Re: readdir() returns inaccessible name if file was created with invalid UTF-8

Reply via email to