Am 01.12.2024 um 14:37 schrieb Hairy Pixels via fpc-pascal:
On Dec 1, 2024 at 2:23:08 PM, Nikolay Nikolov via fpc-pascal <fpc-pascal@lists.freepascal.org> wrote:
Here's how Free Pascal types map to Unicode terminology:

WideChar = UTF-16 code unit

UnicodeString = UTF-16 encoded string

WideString = UTF-16 encoded string. On Windows it's not reference
counted - used for COM compatibility. On other platforms, it's the same
as UnicodeString.

UTF8String = UTF-8 encoded string. Defined as UTF8String=type
AnsiString(CP_UTF8).

UTF16String = alias for UnicodeString

Hope this clears things up.


Another thing:

For conversions between different encodings to work (e.g. between UTF-8
and UTF-16), you need to install a widestring manager. Some platforms
(like Windows) always include one by default, but other platforms (e.g.
Linux) don't, in order to reduce bloat, for programs that don't need it.
For these, you may need to include unit cwstring or something like that.

Including that unit is sneaky, seems you need it anytime dealing with unicode. Not sure how it even knows to change the meaning of those character constants.

There is nothing sneaky about this. This is simply how things work in FPC to avoid linking against the C-library (or including quite a load of Unicode data in case of fpwidestring instead of cwstring) when for much code it isn't necessary (just like the need to use unit cthreads on *nix-systems to install the threading manager).

Using the term “char” was maybe a mistake. This misleads people into thinking it’s a “character” as they perceive it but really it’s just a code point.

There isn't much choice here, cause that type name exists from old Pascal times and that will not change (well, okay, it will change in so far as when the Unicode RTL is enabled it will be Char = WideChar instead of Char = AnsiChar as it is now).

Why isn’t there a “UnicodeChar” type which is 4 bytes and hold a full UTF-8 character?

There is, it's called UCS4Char. Also it's not a "full UTF-8 character", but simply a "Unicode code point".

Choosing UTF-16 for UnicodeString was probably a mistake too.

Take that up with Borland, cause they termed it as "UnicodeString". Which is mainly because they only had to deal with Windows compatibility where there either were the single Byte encodings or the UTF-16 encoding.

It’s my understanding all websites are UTF-8 which means this encoding will dominate everything. I think UTF-8  is by far the most used right?

UTF-8 is usually used for textual encoding, because it is the most memory dense Unicode encoding, however many languages or runtimes including JavaScript, Java's JVM, the .Net CLR, Windows, Qt, UEFI and Delphi >= 2009 use UTF-16 internally.

As a user I would expect to take a string constant and assigning it to a UnicodeString would let me iterate over UnicodeChar. That’s logical right?  Maybe this is just left undone as of now. I don’t know.

var
  u: UnicodeChar;
  s: UnicodeString;
begin
  s := 'Hello, 🌎!';
  for u in s do
    writeln(u);

Here you go:

=== code begin ===

program tstrenum;

{$codepage utf8}
{$mode objfpc}{$H+}
{$modeswitch advancedrecords}

type
  TUCS4CharUnicodeStrEnumerator = record
  private
    fStr: UnicodeString;
    fIndex: SizeInt;
    fCurrent: UCS4Char;
  public
    constructor Create(const aStr: UnicodeString);
    function MoveNext: Boolean;
    property Current: UCS4Char read fCurrent;
  end;

constructor TUCS4CharUnicodeStrEnumerator.Create(const aStr: UnicodeString);
begin
  fStr := aStr;
  fIndex := -1;
  fCurrent := 0;
end;

function TUCS4CharUnicodeStrEnumerator.MoveNext: Boolean;
begin
  Inc(fIndex);
  if fIndex > Length(fStr) then
    Exit(False);
  if (Ord(fStr[fIndex]) >= $D800) and (Ord(fStr[fIndex]) <= $DBFF) then begin
    if fIndex < High(fStr) then begin
      if (Ord(fStr[fIndex + 1]) >= $DC00) and (Ord(fStr[fIndex + 1]) <= $DFFF) then begin         fCurrent := UCS4Char(Ord(fStr[fIndex]) - $D800) shl 10 + UCS4Char(Ord(fStr[fIndex + 1])) - $DC00 + $10000;
        Inc(fIndex);
      end else
        fCurrent := Ord(fStr[fIndex]);
    end else
      fCurrent := Ord(fStr[fIndex]);
  end else
    fCurrent := Ord(fStr[fIndex]);
  Result := True;
end;

operator Enumerator(const aStr: UnicodeString): TUCS4CharUnicodeStrEnumerator;
begin
  Result := TUCS4CharUnicodeStrEnumerator.Create(aStr);
end;

var
  s: UnicodeString;
  u: UCS4Char;
begin
  s := 'Hello, 🌎!';
  for u in s do
    Writeln(HexStr(Ord(u), 8));
end.

=== code end ===

Regards,
Sven
_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to