On 7/4/23 09:12, Hairy Pixels via fpc-pascal wrote:
On Jul 4, 2023, at 12:38 PM, Nikolay Nikolov via fpc-pascal
wrote:
For console apps that use the Unicode KVM video unit, I've introduced two
functions for determining the display width of a Unicode string in the video
unit:
function Ex
> On Jul 4, 2023, at 12:38 PM, Nikolay Nikolov via fpc-pascal
> wrote:
>
> For console apps that use the Unicode KVM video unit, I've introduced two
> functions for determining the display width of a Unicode string in the video
> unit:
>
> function ExtendedGraphemeClusterDisplayWidth(const
On 7/4/23 08:08, Nikolay Nikolov wrote:
On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote:
On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote:
You know you're right, with properly enclosed patterns you can
capture everything inside and it works. You won't know if you had
unicode in your st
On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote:
On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote:
You know you're right, with properly enclosed patterns you can capture
everything inside and it works. You won't know if you had unicode in your
string or not though but that depends on wha
> On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote:
>
> You know you're right, with properly enclosed patterns you can capture
> everything inside and it works. You won't know if you had unicode in your
> string or not though but that depends on what's being parsed and if you care
> or not (I
> On Jul 4, 2023, at 11:45 AM, Nikolay Nikolov via fpc-pascal
> wrote:
>
> But you just don't need to do this, in order to tokenize Pascal. The
> beginning and the end of the string literal is the apostrophe, which is
> ASCII. The bear is a sequence of UTF-8 code units (opaque to the compil
On 7/4/23 07:45, Nikolay Nikolov wrote:
On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote:
On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal
wrote:
For what grammar? What characters are allowed in a token? For
example, Free Pascal also has a parser/tokenizer, but since Pascal
On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote:
On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal
wrote:
For what grammar? What characters are allowed in a token? For example, Free
Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only,
it doesn't need
> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal
> wrote:
>
> For what grammar? What characters are allowed in a token? For example, Free
> Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only,
> it doesn't need to understand Unicode characters, so it wor
On 7/4/23 07:17, Hairy Pixels via fpc-pascal wrote:
On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal
wrote:
You need to understand all these terms and know exactly what you need to do.
E.g. are you dealing with keyboard input, are you dealing with the low level
parts of text dis
> On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal
> wrote:
>
> You need to understand all these terms and know exactly what you need to do.
> E.g. are you dealing with keyboard input, are you dealing with the low level
> parts of text display, are you searching for something in th
On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote:
On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal
wrote:
function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
var
CodePointLen: longint;
CodePoint: longword;
begin
Result:=0;
while
> On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal
> wrote:
>
> function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
> // returns the number of codepoints
> var
> CodePointLen: longint;
> CodePoint: longword;
> begin
> Result:=0;
> while (ByteCount>0) do begin
>inc(Resul
On Mon, 3 Jul 2023 17:18:56 +0700
Hairy Pixels via fpc-pascal wrote:
>[...]
> > First of all: Is it valid UTF-8 or do you have to check for broken
> > or malicious sequences?
>
> If they give the parser broken files that's their problem they need
> to fix? the user has control over the file so
El 03/07/2023 a las 10:27, Hairy Pixels via fpc-pascal escribió:
Right now I've just read the file into an AnsiString and indexing assuming a
fixed character size, which breaks of course if non-1 byte characters exist
I also need to know if I come across something like \u1F496 I need to conv
> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal
> wrote:
>
>> What I'm really trying to do is improve a parser so it can read UTF-8
>> files and decode unicode literals in the grammar.
>
> First of all: Is it valid UTF-8 or do you have to check for broken or
> malicious sequenc
Hi Ryan,I’ve created attached unit, which takes a code point and returns the utf8 char as a string. It’s based on the Wikipedia article on UTF8UTF-8 encodes code points in one to four bytes, depending on the value of the code point. The x characters are replaced by the bits of the code point:This t
On Mon, 3 Jul 2023 15:27:10 +0700
Hairy Pixels via fpc-pascal wrote:
>[...]
> I was just curious how ChatGPTs implementation compared to other
> programmer.
Apparently the quality is often terrible. But it can be useful.
> What I'm really trying to do is improve a parser so it can read UTF-8
> On Jul 3, 2023, at 3:05 PM, Mattias Gaertner via fpc-pascal
> wrote:
>
> I wonder, is this thread about testing ChatGPT or do you want to
> implement something useful?
> There are already plenty of optimized UTF-8 functions in the FPC and
> Lazarus sources. Maybe too many, and you have trou
On Mon, 3 Jul 2023 12:01:11 +0700
Hairy Pixels via fpc-pascal wrote:
> > On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal
> > wrote:
> >
> > Useless array of.
> > And it does not return the bytecount.
>
> it's an open array so what's the problem?
>[...]
> > Wrong for byteCount=1
On Mon, 3 Jul 2023 14:12:03 +0700
Hairy Pixels via fpc-pascal wrote:
> > On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal
> > wrote:
> >
> > No - in this case, the "header" is the highest bit of that byte
> > being 0.
>
> Oh it's the header BIT. Admittedly I don't understand how this
>
On 3 July 2023 9:12:03 +0200, Hairy Pixels via fpc-pascal
wrote:
>> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal
>> wrote:
>>
>> No - in this case, the "header" is the highest bit of that byte being 0.
>
>Oh it's the header BIT. Admittedly I don't understand how this function
>retur
> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal
> wrote:
>
> No - in this case, the "header" is the highest bit of that byte being 0.
Oh it's the header BIT. Admittedly I don't understand how this function returns
the highest bit using that case, which I think he was suggesting.
f
On 3 July 2023 8:42:05 +0200, Hairy Pixels via fpc-pascal
wrote:
>> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal
>> wrote:
>>
>> No, the header of a codepoint to figure out the length.
>
>so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and
>1 for the
> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal
> wrote:
>
> No, the header of a codepoint to figure out the length.
so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and
1 for the character?
ASCII #100 is the same character in UTF-8 but it needs a
On Mon, 3 Jul 2023 11:58:33 +0700
Hairy Pixels via fpc-pascal wrote:
> > On Jul 3, 2023, at 11:43 AM, Mattias Gaertner via fpc-pascal
> > wrote:
> >
> > There is a header byte.
> >
> > It depends, if you want to check for invalid UTF-8 sequences.
> >
> > From LazUTF8:
> >
> > function UTF8Co
> On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal
> wrote:
>
> Useless array of.
> And it does not return the bytecount.
it's an open array so what's the problem?
>
>> var
>> i: Integer;
>> byteCount: Integer;
>> begin
>> // Number of bytes required to represent the Unicode
> On Jul 3, 2023, at 11:43 AM, Mattias Gaertner via fpc-pascal
> wrote:
>
> There is a header byte.
>
> It depends, if you want to check for invalid UTF-8 sequences.
>
> From LazUTF8:
>
> function UTF8CodepointSizeFast(p: PChar): integer;
> begin
> case p^ of
>#0..#191 : Result := 1
On Mon, 3 Jul 2023 08:29:11 +0700
Hairy Pixels via fpc-pascal wrote:
> > On Jul 2, 2023, at 11:16 PM, Jer Haan wrote:
> >
> > This table is copied from Wikipedia.Hope it’s useful
> > for you. If you improve the code pls let me know.
>
> This is perfect, thanks! Much more complicated than I th
On Mon, 3 Jul 2023 09:34:10 +0700
Hairy Pixels via fpc-pascal wrote:
>[...]
> Ok today I I just tried to ask ChatGPT and got an answer. I must have
> asked the wrong thing yesterday but it got it right today (with one
> syntax error using an inline "var" in the code section for some
> reason).
>
> On Jul 3, 2023, at 12:20 AM, Nikolay Nikolov via fpc-pascal
> wrote:
>
> There's no such thing as "unicode scalar" in Unicode terminology:
>
> https://unicode.org/glossary/
I got it from here
https://docs.swift.org/swift-book/documentation/the-swift-programming-language/stringsandcharact
> On Jul 2, 2023, at 11:16 PM, Jer Haan wrote:
>
> This table is copied from Wikipedia.Hope it’s useful for you.
> If you improve the code pls let me know.
>
This is perfect, thanks! Much more complicated than I thought.
I'm curious now, if you were going the other direction and parsing a s
On 7/2/23 20:38, Martin Frb via fpc-pascal wrote:
On 02/07/2023 19:20, Nikolay Nikolov via fpc-pascal wrote:
On 7/2/23 16:30, Hairy Pixels via fpc-pascal wrote:
I'm interested in parsing unicode scalars (I think they're called)
to byte sized values but I'm not sure where to start. First thing
On 02/07/2023 19:20, Nikolay Nikolov via fpc-pascal wrote:
On 7/2/23 16:30, Hairy Pixels via fpc-pascal wrote:
I'm interested in parsing unicode scalars (I think they're called) to
byte sized values but I'm not sure where to start. First thing I did
was choose the unicode scalar U+1F496 (💖).
On 7/2/23 16:30, Hairy Pixels via fpc-pascal wrote:
I'm interested in parsing unicode scalars (I think they're called) to byte
sized values but I'm not sure where to start. First thing I did was choose the
unicode scalar U+1F496 (💖).
There's no such thing as "unicode scalar" in Unicode termin
I'm interested in parsing unicode scalars (I think they're called) to byte
sized values but I'm not sure where to start. First thing I did was choose the
unicode scalar U+1F496 (💖).
Next I cheated and ask ChatGPT. :) Amazingly from my question it was able to
tell me the scaler is comprised of t
36 matches
Mail list logo