Re: readchars, seek back, and readchars again

2020-04-24 Thread William Michels via perl6-users
Hi Joe, I was able to run the code you posted and reproduced the exact same result (Rakudo version 2020.02.1..1 built on MoarVM version 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a bit (e.g. UTF8-C8), but I didn't see any improvement. Yary has an issue posted regard

Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
I was just posting that. On 4/24/20, Elizabeth Mattijsen wrote: > > >> On 24 Apr 2020, at 22:03, Joseph Brenner wrote: >> >> Thanks, yes I understand unicode and utf-8 reasonably well. >> >>> So Rakudo has to read the next codepoint to make sure that it isn't a >>> combining codepoint. >> >>> It

Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
Another version of my test code, checking .tell throughout: use v6; use Test; my $tmpdir = IO::Spec::Unix.tmpdir; my $file = "$tmpdir/scratch_file.txt"; my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]"; # ሀⶀ䷼ꪪⲤⲎ my $ascii_str = "ABCDEFGHI"; test_read_and_read_again($unicha

Re: readchars, seek back, and readchars again

2020-04-24 Thread Elizabeth Mattijsen
> On 24 Apr 2020, at 22:03, Joseph Brenner wrote: > > Thanks, yes I understand unicode and utf-8 reasonably well. > >> So Rakudo has to read the next codepoint to make sure that it isn't a >> combining codepoint. > >> It is probably faking up the reads to look right when reading ASCII, but

Re: readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
Thanks, yes I understand unicode and utf-8 reasonably well. > So Rakudo has to read the next codepoint to make sure that it isn't a > combining codepoint. > It is probably faking up the reads to look right when reading ASCII, but > failing to do that for wider codepoints. I think it'd be the o

Re: readchars, seek back, and readchars again

2020-04-24 Thread Brad Gilbert
In UTF8 characters can be 1 to 4 bytes long. UTF8 was designed so that 7-bit ASCII is a subset of it. Any 8bit byte that has its most significant bit set cannot be ASCII. So multi-byte codepoints have the most significant bit set for all of the bytes. The first byte can tell you the number of byt

readchars, seek back, and readchars again

2020-04-24 Thread Joseph Brenner
I thought that doing a readchars on a filehandle, seeking backwards the width of the char in bytes and then doing another read would always get the same character. That works for ascii-range characters (1-byte in utf-8 encoding) but not multi-byte "wide" characters (commonly 3-bytes in utf-8). Th