Hi Joe, I was able to run the code you posted and reproduced the exact same result (Rakudo version 2020.02.1.0000.1 built on MoarVM version 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a bit (e.g. UTF8-C8), but I didn't see any improvement.
Yary has an issue posted regarding 'display-width' of UTF-16 encoded strings: https://github.com/rakudo/rakudo/issues/3461 I know it might be far-fetched, but what if your UTF-8 issue and Yary's UTF-16 issue were related? It would be nice to kill two birds with one stone. Best Regards, Bill. On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner <doom...@gmail.com> wrote: > > Another version of my test code, checking .tell throughout: > > use v6; > use Test; > > my $tmpdir = IO::Spec::Unix.tmpdir; > my $file = "$tmpdir/scratch_file.txt"; > my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]"; # > ሀⶀ䷼ꪪⲤⲎ > my $ascii_str = "ABCDEFGHI"; > > test_read_and_read_again($unichar_str, $file, 3); > test_read_and_read_again($ascii_str, $file, 0); > > # write given string to file, then read the third character twice and check > sub test_read_and_read_again($str, $file, $nudge = 0) { > spurt $file, $str; > my $fh = $file.IO.open; > printf "%d: just opened\n", $fh.tell; > $fh.readchars(2); # skip a few > printf "%d: after skipping 2\n", $fh.tell; > my $chr_1 = $fh.readchars(1); > printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1; > my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, always 1 > or 3 > my $step_back = $width + $nudge; > $fh.seek: -$step_back, SeekFromCurrent; > printf "%d: after seeking back %d\n", $fh.tell, $step_back; > my $chr_2 = $fh.readchars(1); > printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2; > is( $chr_1, $chr_2, > "read, seek back, and read again gets same char with nudge of $nudge" > ); > } > > > The output looks like so: > > /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6 > 0: just opened > 9: after skipping 2 > 12: after reading 3rd: ䷼ > 6: after seeking back 6 > 12: after re-reading 3rd: ䷼ > ok 1 - read, seek back, and read again gets same char with nudge of 3 > 0: just opened > 2: after skipping 2 > 3: after reading 3rd: C > 2: after seeking back 1 > 3: after re-reading 3rd: C > ok 2 - read, seek back, and read again gets same char with nudge of 0 > > It's really hard to see what I should do if I really wanted to > intermix readchars and seeks like this... I'd need to check the range > of the codepoint to see how far I need to seek to get where I expect > to be. > > > > On 4/24/20, Joseph Brenner <doom...@gmail.com> wrote: > > Thanks, yes I understand unicode and utf-8 reasonably well. > > > >> So Rakudo has to read the next codepoint to make sure that it isn't a > >> combining codepoint. > > > >> It is probably faking up the reads to look right when reading ASCII, but > >> failing to do that for wider codepoints. > > > > I think it'd be the other way around... the idea here would be it's > > doing an extra readchar behind the scenes just in-case there's > > combining chars involved-- so you're figuring there's some confusion > > about the actual point in the file that's being read and the > > abstraction that readchars is supplying? > > > > > > On 4/24/20, Brad Gilbert <b2gi...@gmail.com> wrote: > >> In UTF8 characters can be 1 to 4 bytes long. > >> > >> UTF8 was designed so that 7-bit ASCII is a subset of it. > >> > >> Any 8bit byte that has its most significant bit set cannot be ASCII. > >> So multi-byte codepoints have the most significant bit set for all of the > >> bytes. > >> The first byte can tell you the number of bytes that follow it. > >> > >> That is how a singe codepoint is stored. > >> > >> A character can be made of several codepoints. > >> > >> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]" > >> "é" > >> > >> So Rakudo has to read the next codepoint to make sure that it isn't a > >> combining codepoint. > >> > >> It is probably faking up the reads to look right when reading ASCII, but > >> failing to do that for wider codepoints. > >> > >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doom...@gmail.com> wrote: > >> > >>> I thought that doing a readchars on a filehandle, seeking backwards > >>> the width of the char in bytes and then doing another read > >>> would always get the same character. That works for ascii-range > >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide" > >>> characters (commonly 3-bytes in utf-8). > >>> > >>> The question then, is why do I need a $nudge of 3 for wide chars, but > >>> not ascii-range ones? > >>> > >>> use v6; > >>> use Test; > >>> > >>> my $tmpdir = IO::Spec::Unix.tmpdir; > >>> my $file = "$tmpdir/scratch_file.txt"; > >>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]"; # > >>> ሀⶀ䷼ꪪⲤⲎ > >>> my $ascii_str = "ABCDEFGHI"; > >>> > >>> subtest { > >>> my $nudge = 3; > >>> test_read_and_read_again($unichar_str, $file, $nudge); > >>> }, "Wide unicode chars: $unichar_str"; > >>> > >>> subtest { > >>> my $nudge = 0; > >>> test_read_and_read_again($ascii_str, $file, $nudge); > >>> }, "Ascii-range chars: $ascii_str"; > >>> > >>> # write given string to file, then read the third character twice and > >>> check > >>> sub test_read_and_read_again($str, $file, $nudge = 0) { > >>> spurt $file, $str; > >>> my $fh = $file.IO.open; > >>> $fh.readchars(2); # skip a few > >>> my $chr_1 = $fh.readchars(1); > >>> my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, > >>> always > >>> 1 or 3 > >>> my $step_back = $width + $nudge; > >>> $fh.seek: -$step_back, SeekFromCurrent; > >>> my $chr_2 = $fh.readchars(1); > >>> is( $chr_1, $chr_2, > >>> "read, seek back, and read again gets same char with nudge of > >>> $nudge" ); > >>> } > >>> > >> > >