Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin:
On Jul 23 05:44, Thomas Wolff via Cygwin wrote:
OK, suppose I'd consider to switch to mbs[[n]r]towcs, collecting bytes until
the function gives me a result.
This would work fine as long as I receive only valid sequences. But look at
input string test case
char nonbmp[] = {0xF8, 0x88, 0x8A, 0xAF, 0x2D, 0}; // an invalid sequence
followed by a valid char
The functions only return -1 and (in the case of mbsnrtowcs) do not advance
the input pointer.
So how am I supposed to recognize that the invalid sequence has ended and a
valid character has arrived?
Yeah, I see the problem. One of the slightly puzzeling behaviours
of mbsnrtowcs is the fact that the src pointer stays at the start of
the invalid sequence. I think the idea is to skip the invalid sequence
byte-wise until wcsnrtombs reports a valid sequence again.
But an incomplete sequence could still be completed to become a valid
sequence...
So I could check a maximum length of bytes, say, with high bit set. Not
sure whether that works for other multibyte encodings, certainly not for
GB18030.
What bugs me is that we have the choice between a broken mbrtowc on
one side and a chance to generate broken filenames on the other side.
I did not look into those details, but while characters to be handled by
a terminal come sequentially as a stream, filenames can be handled as a
compound string, isn't that easier to check?
I think we should actually revert fa272e05bbd0 ("wcstombs: also call
__WCTOMB on terminating NUL if output buffer is NULL") and see if we can
fix the filename issue in the Cygwin functions for filename conversion
alone.
Any ideas appreciated.
Corinna
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple