Hi Joe,
I was able to run the code you posted and reproduced the exact same
result (Rakudo version 2020.02.1..1 built on MoarVM version
2020.02.1 implementing Raku 6.d). I tried playing with file encodings a bit
(e.g. UTF8-C8), but I didn't see any improvement.
Yary has an issue posted regard
I was just posting that.
On 4/24/20, Elizabeth Mattijsen wrote:
>
>
>> On 24 Apr 2020, at 22:03, Joseph Brenner wrote:
>>
>> Thanks, yes I understand unicode and utf-8 reasonably well.
>>
>>> So Rakudo has to read the next codepoint to make sure that it isn't a
>>> combining codepoint.
>>
>>> It
Another version of my test code, checking .tell throughout:
use v6;
use Test;
my $tmpdir = IO::Spec::Unix.tmpdir;
my $file = "$tmpdir/scratch_file.txt";
my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[]\x[2CA4]\x[2C8E]"; # ሀⶀ䷼ꪪⲤⲎ
my $ascii_str = "ABCDEFGHI";
test_read_and_read_again($unicha
> On 24 Apr 2020, at 22:03, Joseph Brenner wrote:
>
> Thanks, yes I understand unicode and utf-8 reasonably well.
>
>> So Rakudo has to read the next codepoint to make sure that it isn't a
>> combining codepoint.
>
>> It is probably faking up the reads to look right when reading ASCII, but
Thanks, yes I understand unicode and utf-8 reasonably well.
> So Rakudo has to read the next codepoint to make sure that it isn't a
> combining codepoint.
> It is probably faking up the reads to look right when reading ASCII, but
> failing to do that for wider codepoints.
I think it'd be the o
In UTF8 characters can be 1 to 4 bytes long.
UTF8 was designed so that 7-bit ASCII is a subset of it.
Any 8bit byte that has its most significant bit set cannot be ASCII.
So multi-byte codepoints have the most significant bit set for all of the
bytes.
The first byte can tell you the number of byt
I thought that doing a readchars on a filehandle, seeking backwards
the width of the char in bytes and then doing another read
would always get the same character. That works for ascii-range
characters (1-byte in utf-8 encoding) but not multi-byte "wide"
characters (commonly 3-bytes in utf-8).
Th