Thanks, yes I understand unicode and utf-8 reasonably well. > So Rakudo has to read the next codepoint to make sure that it isn't a > combining codepoint.
> It is probably faking up the reads to look right when reading ASCII, but > failing to do that for wider codepoints. I think it'd be the other way around... the idea here would be it's doing an extra readchar behind the scenes just in-case there's combining chars involved-- so you're figuring there's some confusion about the actual point in the file that's being read and the abstraction that readchars is supplying? On 4/24/20, Brad Gilbert <b2gi...@gmail.com> wrote: > In UTF8 characters can be 1 to 4 bytes long. > > UTF8 was designed so that 7-bit ASCII is a subset of it. > > Any 8bit byte that has its most significant bit set cannot be ASCII. > So multi-byte codepoints have the most significant bit set for all of the > bytes. > The first byte can tell you the number of bytes that follow it. > > That is how a singe codepoint is stored. > > A character can be made of several codepoints. > > "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]" > "é" > > So Rakudo has to read the next codepoint to make sure that it isn't a > combining codepoint. > > It is probably faking up the reads to look right when reading ASCII, but > failing to do that for wider codepoints. > > On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doom...@gmail.com> wrote: > >> I thought that doing a readchars on a filehandle, seeking backwards >> the width of the char in bytes and then doing another read >> would always get the same character. That works for ascii-range >> characters (1-byte in utf-8 encoding) but not multi-byte "wide" >> characters (commonly 3-bytes in utf-8). >> >> The question then, is why do I need a $nudge of 3 for wide chars, but >> not ascii-range ones? >> >> use v6; >> use Test; >> >> my $tmpdir = IO::Spec::Unix.tmpdir; >> my $file = "$tmpdir/scratch_file.txt"; >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]"; # >> ሀⶀ䷼ꪪⲤⲎ >> my $ascii_str = "ABCDEFGHI"; >> >> subtest { >> my $nudge = 3; >> test_read_and_read_again($unichar_str, $file, $nudge); >> }, "Wide unicode chars: $unichar_str"; >> >> subtest { >> my $nudge = 0; >> test_read_and_read_again($ascii_str, $file, $nudge); >> }, "Ascii-range chars: $ascii_str"; >> >> # write given string to file, then read the third character twice and >> check >> sub test_read_and_read_again($str, $file, $nudge = 0) { >> spurt $file, $str; >> my $fh = $file.IO.open; >> $fh.readchars(2); # skip a few >> my $chr_1 = $fh.readchars(1); >> my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, always >> 1 or 3 >> my $step_back = $width + $nudge; >> $fh.seek: -$step_back, SeekFromCurrent; >> my $chr_2 = $fh.readchars(1); >> is( $chr_1, $chr_2, >> "read, seek back, and read again gets same char with nudge of >> $nudge" ); >> } >> >