On Jul 28 12:37, Andy Koppe wrote: > 2009/7/28 Corinna Vinschen: > >> Trouble is, the hack will also only work correctly if the whole UTF-8 > >> sequence for the non-BMP character is passed at once. If you pass the > >> bytes one-by-one instead, and assuming the bug above wasn't there, > >> you'd get this: > > > > Yes, I know. The real trouble is, I don't know how that can be fixed > > in a still sort-of-POSIXy way. > > The way I'd suggested is sort-of-POSIXy, but perhaps not enough, > because apps that check the mbrtowc() return code (and not the written > wc) against zero will interpret a low surrogate as string end. An > alternative might be to just return an error when there's no compliant > way to return the low surrogate. Do you think either of these are > worth pursuing?
I'm thinking of faking a valid return of 1 (or 2, or 3) after the third byte has been read. Three bytes are sufficient to create the first surrogate half in wc. Upon reading the last byte, return 1 and set wc to the second surrogate half. > Therefore I think long-term Cygwin's wchar_t will need to change to 32 > bits for Linux compatibility. Of course that would require major, > ABI-breaking changes: That's really not an option for now. > - Introduce a separate type for representing UTF-16, e.g. "vchar_t", > because 'v' is half a 'w' ;) There's a proposal to the C standard to add specific Unicode types along the lines of c_utf[8,16,32]_t or something like that. That doesn't help us the least, unfortunately. For now we have to live with the wrong decision to make wchar_t 16 bit on the Win32 platform. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple