On Nov 24 11:01, Brian Inglis via Cygwin wrote: > On 2021-11-24 02:25, Corinna Vinschen via Cygwin wrote: > > > On Tue, Nov 23, 2021 at 11:18:25AM -0700, Brian Inglis wrote: > > > > Do Cygwin and/or Windows support surrogate pairs in UTF-8? > > > > You mean UTF-16. UTF-8 doesn't know surrogate pairs, UTF-16 does. > > Originally there was UCS-2, 16 bits, with only 65536 code points. > > However, Unicode left the BMP already with version 2.0 in 1996, so > > UTF-16 and surrogate pairs became necessary. Windows as well as Cygwin > > support them. > > How does Cygwin support UTF-16 locales with surrogate pairs?
UTF-16 locales? There's no such thing. UTF-16 is just the 16 bit representation for Unicode, and as such, is independent of the locale. On the user side, Cygwin only supports UTF-8 as Unicode representation. Internally you can then convert them to wchar_t which is UTF-16. > Are they the "native" locales inherited from Windows if others are not > specified e.g. UTF-8, some OEM SBCS or MBCS? Just try `locale -av' and you'll see all supported locales and their respective default codeset. All of them can be used with .utf8 specifier to use UTF-8 instead of the default codeset. Some of them use UTF-8 as default codeset anyway, e. g., fa_IR or yo_NG. > > > There are 3 tests in surrogate-pair and only the 3rd one failed. So I > > > guess > > > surrogate pairs in UTF-8 "mostly work". > > > > UTF-16. The surrogate stuff is evil at times. Have a look at the > > __utf8_wctomb function in > > https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=newlib/libc/stdlib/wctomb_r.c > > Lone surrogate halfs in an input stream are a problem, for instance. > > Thus the confusion with grep surrogate pair tests which appear to be running > under a UTF-8 locale: see attached surrogate pair extract from cygport > --debug grep.cygport check. An STC in plain C might be helpful. Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple