Re: raise(-1) has stopped returning an error recently

Brian Inglis Fri, 26 Nov 2021 23:24:49 -0800

On 2021-11-25 05:54, Corinna Vinschen via Cygwin wrote:

On Nov 24 11:01, Brian Inglis via Cygwin wrote:

On 2021-11-24 02:25, Corinna Vinschen via Cygwin wrote:

On Tue, Nov 23, 2021 at 11:18:25AM -0700, Brian Inglis wrote:

Do Cygwin and/or Windows support surrogate pairs in UTF-8?


You mean UTF-16.  UTF-8 doesn't know surrogate pairs, UTF-16 does.
Originally there was UCS-2, 16 bits, with only 65536 code points.
However, Unicode left the BMP already with version 2.0 in 1996, so
UTF-16 and surrogate pairs became necessary.  Windows as well as Cygwin
support them.


How does Cygwin support UTF-16 locales with surrogate pairs?


UTF-16 locales?  There's no such thing.  UTF-16 is just the 16 bit
representation for Unicode, and as such, is independent of the locale.
On the user side, Cygwin only supports UTF-8 as Unicode representation.
Internally you can then convert them to wchar_t which is UTF-16.

Are they the "native" locales inherited from Windows if others are not
specified e.g. UTF-8, some OEM SBCS or MBCS?


Just try `locale -av' and you'll see all supported locales and their
respective default codeset.  All of them can be used with .utf8
specifier to use UTF-8 instead of the default codeset.  Some of them
use UTF-8 as default codeset anyway, e. g., fa_IR or yo_NG.

There are 3 tests in surrogate-pair and only the 3rd one failed. So I guess
surrogate pairs in UTF-8 "mostly work".


UTF-16.  The surrogate stuff is evil at times.  Have a look at the
__utf8_wctomb function in
https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=newlib/libc/stdlib/wctomb_r.c
Lone surrogate halfs in an input stream are a problem, for instance.


Thus the confusion with grep surrogate pair tests which appear to be running
under a UTF-8 locale: see attached surrogate pair extract from cygport
--debug grep.cygport check.


An STC in plain C might be helpful.

I think I might finally have got the point of the test, not knowing muchabout legacy UTF-16 UCS encoding nor surrogate pairs.


From what I can see:

𐐅  U+010405  f0 90 90 85  DESERET CAPITAL LETTER LONG OO

fails to match itself, presumably others do also.

Presumably this is converted internally on some platforms, includingCygwin, to a UTF-16 surrogate pair, and a grep comparison fails,although a bash comparison succeeds.


$ printf '\U10405\n' | iconv -f utf-8 -t utf-16be | xxd -g2
00000000: d801 dc05 000a
$ printf '\U10405\n' > t
$ grep -f t t; echo $?
1
$ oo=`printf '\U10405\n'`; [ $oo = $oo ] && echo same || echo diff
same

--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]

--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Re: raise(-1) has stopped returning an error recently

Reply via email to