Am 01.12.2024 um 14:37 schrieb Hairy Pixels via fpc-pascal:
On Dec 1, 2024 at 2:23:08 PM, Nikolay Nikolov via fpc-pascal
<fpc-pascal@lists.freepascal.org> wrote:
Here's how Free Pascal types map to Unicode terminology:
WideChar = UTF-16 code unit
UnicodeString = UTF-16 encoded string
WideString = UTF-16 encoded string. On Windows it's not reference
counted - used for COM compatibility. On other platforms, it's the same
as UnicodeString.
UTF8String = UTF-8 encoded string. Defined as UTF8String=type
AnsiString(CP_UTF8).
UTF16String = alias for UnicodeString
Hope this clears things up.
Another thing:
For conversions between different encodings to work (e.g. between UTF-8
and UTF-16), you need to install a widestring manager. Some platforms
(like Windows) always include one by default, but other platforms (e.g.
Linux) don't, in order to reduce bloat, for programs that don't need it.
For these, you may need to include unit cwstring or something like that.
Including that unit is sneaky, seems you need it anytime dealing with
unicode. Not sure how it even knows to change the meaning of those
character constants.
There is nothing sneaky about this. This is simply how things work in
FPC to avoid linking against the C-library (or including quite a load of
Unicode data in case of fpwidestring instead of cwstring) when for much
code it isn't necessary (just like the need to use unit cthreads on
*nix-systems to install the threading manager).
Using the term “char” was maybe a mistake. This misleads people into
thinking it’s a “character” as they perceive it but really it’s just a
code point.
There isn't much choice here, cause that type name exists from old
Pascal times and that will not change (well, okay, it will change in so
far as when the Unicode RTL is enabled it will be Char = WideChar
instead of Char = AnsiChar as it is now).
Why isn’t there a “UnicodeChar” type which is 4 bytes and hold a full
UTF-8 character?
There is, it's called UCS4Char. Also it's not a "full UTF-8 character",
but simply a "Unicode code point".
Choosing UTF-16 for UnicodeString was probably a mistake too.
Take that up with Borland, cause they termed it as "UnicodeString".
Which is mainly because they only had to deal with Windows compatibility
where there either were the single Byte encodings or the UTF-16 encoding.
It’s my understanding all websites are UTF-8 which means this encoding
will dominate everything. I think UTF-8 is by far the most used right?
UTF-8 is usually used for textual encoding, because it is the most
memory dense Unicode encoding, however many languages or runtimes
including JavaScript, Java's JVM, the .Net CLR, Windows, Qt, UEFI and
Delphi >= 2009 use UTF-16 internally.
As a user I would expect to take a string constant and assigning it to
a UnicodeString would let me iterate over UnicodeChar. That’s logical
right? Maybe this is just left undone as of now. I don’t know.
var
u: UnicodeChar;
s: UnicodeString;
begin
s := 'Hello, 🌎!';
for u in s do
writeln(u);
Here you go:
=== code begin ===
program tstrenum;
{$codepage utf8}
{$mode objfpc}{$H+}
{$modeswitch advancedrecords}
type
TUCS4CharUnicodeStrEnumerator = record
private
fStr: UnicodeString;
fIndex: SizeInt;
fCurrent: UCS4Char;
public
constructor Create(const aStr: UnicodeString);
function MoveNext: Boolean;
property Current: UCS4Char read fCurrent;
end;
constructor TUCS4CharUnicodeStrEnumerator.Create(const aStr: UnicodeString);
begin
fStr := aStr;
fIndex := -1;
fCurrent := 0;
end;
function TUCS4CharUnicodeStrEnumerator.MoveNext: Boolean;
begin
Inc(fIndex);
if fIndex > Length(fStr) then
Exit(False);
if (Ord(fStr[fIndex]) >= $D800) and (Ord(fStr[fIndex]) <= $DBFF) then
begin
if fIndex < High(fStr) then begin
if (Ord(fStr[fIndex + 1]) >= $DC00) and (Ord(fStr[fIndex + 1]) <=
$DFFF) then begin
fCurrent := UCS4Char(Ord(fStr[fIndex]) - $D800) shl 10 +
UCS4Char(Ord(fStr[fIndex + 1])) - $DC00 + $10000;
Inc(fIndex);
end else
fCurrent := Ord(fStr[fIndex]);
end else
fCurrent := Ord(fStr[fIndex]);
end else
fCurrent := Ord(fStr[fIndex]);
Result := True;
end;
operator Enumerator(const aStr: UnicodeString):
TUCS4CharUnicodeStrEnumerator;
begin
Result := TUCS4CharUnicodeStrEnumerator.Create(aStr);
end;
var
s: UnicodeString;
u: UCS4Char;
begin
s := 'Hello, 🌎!';
for u in s do
Writeln(HexStr(Ord(u), 8));
end.
=== code end ===
Regards,
Sven
_______________________________________________
fpc-pascal maillist - fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal