On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote: > > Yary has an issue posted regarding 'display-width' of UTF-16 encoded strings: > > https://github.com/rakudo/rakudo/issues/3461 > > > > I know it might be far-fetched, but what if your UTF-8 issue and > > Yary's UTF-16 issue were related > > Well, an issue with handling combining characters could easily effect > both, nothing about it is specific to one encoding. Yary's issue > doesn't have to do with reading from disk though, he's just looking at > the raw bytes the encoding generates. > > On 4/24/20, William Michels <w...@caa.columbia.edu> wrote: > > Hi Joe, > > > > I was able to run the code you posted and reproduced the exact same > > result (Rakudo version 2020.02.1.0000.1 built on MoarVM version > > 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a > > bit > > (e.g. UTF8-C8), but I didn't see any improvement. > > > > Yary has an issue posted regarding 'display-width' of UTF-16 encoded > > strings: > > > > https://github.com/rakudo/rakudo/issues/3461 > > > > I know it might be far-fetched, but what if your UTF-8 issue and > > Yary's UTF-16 issue were related? It would be nice to kill two birds > > with one stone. > > > > Best Regards, Bill. > > > > On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner <doom...@gmail.com> wrote: > >> Another version of my test code, checking .tell throughout: > >> > >> use v6; > >> use Test; > >> > >> my $tmpdir = IO::Spec::Unix.tmpdir; > >> my $file = "$tmpdir/scratch_file.txt"; > >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]"; # > >> ሀⶀ䷼ꪪⲤⲎ > >> my $ascii_str = "ABCDEFGHI"; > >> > >> test_read_and_read_again($unichar_str, $file, 3); > >> test_read_and_read_again($ascii_str, $file, 0); > >> > >> # write given string to file, then read the third character twice and > >> check > >> sub test_read_and_read_again($str, $file, $nudge = 0) { > >> > >> spurt $file, $str; > >> my $fh = $file.IO.open; > >> printf "%d: just opened\n", $fh.tell; > >> $fh.readchars(2); # skip a few > >> printf "%d: after skipping 2\n", $fh.tell; > >> my $chr_1 = $fh.readchars(1); > >> printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1; > >> my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, always > >> > >> 1 or 3 > >> > >> my $step_back = $width + $nudge; > >> $fh.seek: -$step_back, SeekFromCurrent; > >> printf "%d: after seeking back %d\n", $fh.tell, $step_back; > >> my $chr_2 = $fh.readchars(1); > >> printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2; > >> is( $chr_1, $chr_2, > >> > >> "read, seek back, and read again gets same char with nudge of > >> > >> $nudge" ); > >> } > >> > >> > >> The output looks like so: > >> > >> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6 > >> 0: just opened > >> 9: after skipping 2 > >> 12: after reading 3rd: ䷼ > >> 6: after seeking back 6 > >> 12: after re-reading 3rd: ䷼ > >> ok 1 - read, seek back, and read again gets same char with nudge of 3 > >> 0: just opened > >> 2: after skipping 2 > >> 3: after reading 3rd: C > >> 2: after seeking back 1 > >> 3: after re-reading 3rd: C > >> ok 2 - read, seek back, and read again gets same char with nudge of 0 > >> > >> It's really hard to see what I should do if I really wanted to > >> intermix readchars and seeks like this... I'd need to check the range > >> of the codepoint to see how far I need to seek to get where I expect > >> to be. > >> > >> On 4/24/20, Joseph Brenner <doom...@gmail.com> wrote: > >> > Thanks, yes I understand unicode and utf-8 reasonably well. > >> > > >> >> So Rakudo has to read the next codepoint to make sure that it isn't a > >> >> combining codepoint. > >> >> > >> >> It is probably faking up the reads to look right when reading ASCII, > >> >> but > >> >> failing to do that for wider codepoints. > >> > > >> > I think it'd be the other way around... the idea here would be it's > >> > doing an extra readchar behind the scenes just in-case there's > >> > combining chars involved-- so you're figuring there's some confusion > >> > about the actual point in the file that's being read and the > >> > abstraction that readchars is supplying? > >> > > >> > On 4/24/20, Brad Gilbert <b2gi...@gmail.com> wrote: > >> >> In UTF8 characters can be 1 to 4 bytes long. > >> >> > >> >> UTF8 was designed so that 7-bit ASCII is a subset of it. > >> >> > >> >> Any 8bit byte that has its most significant bit set cannot be ASCII. > >> >> So multi-byte codepoints have the most significant bit set for all of > >> >> the > >> >> bytes. > >> >> The first byte can tell you the number of bytes that follow it. > >> >> > >> >> That is how a singe codepoint is stored. > >> >> > >> >> A character can be made of several codepoints. > >> >> > >> >> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]" > >> >> "é" > >> >> > >> >> So Rakudo has to read the next codepoint to make sure that it isn't a > >> >> combining codepoint. > >> >> > >> >> It is probably faking up the reads to look right when reading ASCII, > >> >> but > >> >> failing to do that for wider codepoints. > >> >> > >> >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doom...@gmail.com> > >> >> > >> >> wrote: > >> >>> I thought that doing a readchars on a filehandle, seeking backwards > >> >>> the width of the char in bytes and then doing another read > >> >>> would always get the same character. That works for ascii-range > >> >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide" > >> >>> characters (commonly 3-bytes in utf-8). > >> >>> > >> >>> The question then, is why do I need a $nudge of 3 for wide chars, but > >> >>> not ascii-range ones? > >> >>> > >> >>> use v6; > >> >>> use Test; > >> >>> > >> >>> my $tmpdir = IO::Spec::Unix.tmpdir; > >> >>> my $file = "$tmpdir/scratch_file.txt"; > >> >>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]"; > >> >>> # > >> >>> ሀⶀ䷼ꪪⲤⲎ > >> >>> my $ascii_str = "ABCDEFGHI"; > >> >>> > >> >>> subtest { > >> >>> > >> >>> my $nudge = 3; > >> >>> test_read_and_read_again($unichar_str, $file, $nudge); > >> >>> > >> >>> }, "Wide unicode chars: $unichar_str"; > >> >>> > >> >>> subtest { > >> >>> > >> >>> my $nudge = 0; > >> >>> test_read_and_read_again($ascii_str, $file, $nudge); > >> >>> > >> >>> }, "Ascii-range chars: $ascii_str"; > >> >>> > >> >>> # write given string to file, then read the third character twice and > >> >>> check > >> >>> sub test_read_and_read_again($str, $file, $nudge = 0) { > >> >>> > >> >>> spurt $file, $str; > >> >>> my $fh = $file.IO.open; > >> >>> $fh.readchars(2); # skip a few > >> >>> my $chr_1 = $fh.readchars(1); > >> >>> my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, > >> >>> > >> >>> always > >> >>> 1 or 3 > >> >>> > >> >>> my $step_back = $width + $nudge; > >> >>> $fh.seek: -$step_back, SeekFromCurrent; > >> >>> my $chr_2 = $fh.readchars(1); > >> >>> is( $chr_1, $chr_2, > >> >>> > >> >>> "read, seek back, and read again gets same char with nudge of > >> >>> > >> >>> $nudge" ); > >> >>> }
I don't think the utf-16 issue is related. On the topic of readchars. Can someone tell me what that readchars script result is unexpected? Maybe I missed part of the conversation but if someone can summarize expected vs actual result that would be great.