El 03/07/2023 a las 10:27, Hairy Pixels via fpc-pascal escribió:

Right now I've just read the file into an AnsiString and indexing assuming a 
fixed character size, which breaks of course if non-1 byte characters exist

  I also need to know if I come across something like \u1F496 I need to convert 
that to a unicode character.


Hello,

You are intermixing a lot of concepts, ASCII, Unicode, grapheme, representation, content, etc...

Talking about Unicode you must forget ASCII, the text is a sequence of bytes which are encoded in a special format (UTF-8, UTF-16, UTF-32,...) and that must be represented in screen using Unicode representation rules, which are not the same as ASCII.

Just to keep this message quite short, think in a text with only one "letter":

"á"

This text (text, not one letter, Unicode is about texts) can be transmitted or stored using Unicode encoding rules which are a sequence of bytes with its own rules to encode the information. Each byte is hexadecimal:

UTF8: C3 A1
UTF16LE: 00 E1
UTF32: 00 00 00 E1

You must know in advance the encoding format to get the text from the bytes sequence. There is also a BOM (Byte Order Mark) which is sometimes used in files as a header to indicate the encoding, but in general it is not used.

Now decoding that sequence of bytes, using the right decoding format you get a text which represent the letter "a" with an acute accent, but Unicode is *not* so *simple* and the same text could be represented in screen using letter "a" + "combining acute accent" and bytes sequence is totally different, different at encoding level but identical at renderization level. So this two UTF8 sequences:

"C3 A1" and "61 CC 81"

are different at grapheme level and encoding level but identical at representation level.

Just as final note, this is the UTF-8 sequence of bytes for one single "character" in screen:

F0 9F 8F B4 F3 A0 81 A7 F3 A0 81 A2 F3 A0 81 B3 F3 A0 81 A3 F3 A0 81 B4 F3 A0 81 BF

Unicode is far, far from easy.

Have a nice day.
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to