On 3/7/21 7:21 PM, Ryan Joseph via fpc-pascal wrote:

On Mar 7, 2021, at 10:11 AM, Marco van de Voort via fpc-pascal 
<fpc-pascal@lists.freepascal.org> wrote:


Yes it is. And there are about 1114000 unicode codepoints, or about 17 times 
what fits in a 2-byte wide char.

https://en.wikipedia.org/wiki/Code_point

https://en.wikipedia.org/wiki/UTF-16
I thought unicode strings "just worked" but maybe that's UTF-8 and the 
character I want is maybe UTF-16. What are you supposed to do then? UnicodeString knows 
how to print the full string so all the data is there but I can't index to get characters 
unless I know their size.

It depends on what you mean by "just working". UnicodeString is an UTF-16 encoded string and a WideChar is just a UTF-16 code unit. Both UTF-8 and UTF-16 are variable length encodings. UTF-16 is just more simple to decode. Note also that, even though a single Unicode codepoint might need two UTF-16 code units (i.e. WideChars), that is still not enough to represent what users perceive as a character. There are also plenty of Unicode combining characters. What most users perceive as a character is actually called an Extended Grapheme Cluster and is actually a sequence of Unicode code points. There's an algorithm (an enumerator) that splits a string into grapheme clusters, and that's implemented in FPC trunk in the GraphemeBreakProperty unit. It implements this algorithm:

http://www.unicode.org/reports/tr29/

This was done by me for the Unicode Free Vision port in the unicodekvm SVN branch, but it was already committed to trunk (the rest of the Unicode Free Vision still isn't), because it's a new unit that is relatively self-contained and provides new functionality (so, won't break existing code) that wasn't provided by the RTL before.

Note that normally, most programs wouldn't actually need to split a string into grapheme clusters, unless they implement something like a UI toolkit or a text editor or something of that sort. That's why it was needed for the Unicode Free Vision.

Nikolay

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to