Re: [fpc-pascal] Parse unicode scalar

Mattias Gaertner via fpc-pascal Sun, 02 Jul 2023 22:05:03 -0700

On Mon, 3 Jul 2023 11:58:33 +0700
Hairy Pixels via fpc-pascal <fpc-pascal@lists.freepascal.org> wrote:


> > On Jul 3, 2023, at 11:43 AM, Mattias Gaertner via fpc-pascal
> > <fpc-pascal@lists.freepascal.org> wrote:
> > 
> > There is a header byte.
> > 
> > It depends, if you want to check for invalid UTF-8 sequences.
> > 
> > From LazUTF8:
> > 
> > function UTF8CodepointSizeFast(p: PChar): integer;
> > begin
> >  case p^ of
> >    #0..#191   : Result := 1;
> >    #192..#223 : Result := 2;
> >    #224..#239 : Result := 3;
> >    #240..#247 : Result := 4;
> >    else Result := 1; // An optimization + prevents compiler warning
> > about uninitialized Result. end;
> > end;  
> 
> This is a header for the file?

No, the header of a codepoint to figure out the length.

> Does that mean the file itself must
> have uniform character sizes?

No.

> I though the idea was to read the file
> one byte at a time but I don't understand how you would know if a 1
> byte character (like ascii) was part of a 4 byte character or not.

ASCII is #0..#127, which is the same character in UTF-8.

Mattias

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

Reply via email to