Re: readchars, seek back, and readchars again

Samantha McVey Tue, 28 Apr 2020 10:48:23 -0700

On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote:
> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded 
strings:
> >  https://github.com/rakudo/rakudo/issues/3461
> >  
> >  I know it might be far-fetched, but what if your UTF-8 issue and
> 
> Yary's UTF-16 issue were related
> 
> Well, an issue with handling combining characters could easily effect
> both, nothing about it is specific to one encoding. Yary's issue
> doesn't have to do with reading from disk though, he's just looking at
> the raw bytes the encoding generates.
> 
> On 4/24/20, William Michels <w...@caa.columbia.edu> wrote:
> > Hi Joe,
> > 
> > I was able to run the code you posted and reproduced the exact same
> > result (Rakudo version 2020.02.1.0000.1 built on MoarVM version
> > 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a
> > bit
> > (e.g. UTF8-C8), but I didn't see any improvement.
> > 
> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> > strings:
> > 
> > https://github.com/rakudo/rakudo/issues/3461
> > 
> > I know it might be far-fetched, but what if your UTF-8 issue and
> > Yary's UTF-16 issue were related? It would be nice to kill two birds
> > with one stone.
> > 
> > Best Regards, Bill.
> > 
> > On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner <doom...@gmail.com> wrote:
> >> Another version of my test code, checking .tell throughout:
> >> 
> >> use v6;
> >> use Test;
> >> 
> >> my $tmpdir = IO::Spec::Unix.tmpdir;
> >> my $file = "$tmpdir/scratch_file.txt";
> >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";  #
> >> ሀⶀ䷼ꪪⲤⲎ
> >> my $ascii_str =   "ABCDEFGHI";
> >> 
> >> test_read_and_read_again($unichar_str, $file, 3);
> >> test_read_and_read_again($ascii_str,   $file, 0);
> >> 
> >> # write given string to file, then read the third character twice and
> >> check
> >> sub test_read_and_read_again($str, $file, $nudge = 0) {
> >> 
> >>     spurt $file, $str;
> >>     my $fh = $file.IO.open;
> >>     printf "%d: just opened\n", $fh.tell;
> >>     $fh.readchars(2);  # skip a few
> >>     printf "%d: after skipping 2\n", $fh.tell;
> >>     my $chr_1 =      $fh.readchars(1);
> >>     printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
> >>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
> >> 
> >> 1 or 3
> >> 
> >>     my $step_back = $width + $nudge;
> >>     $fh.seek: -$step_back, SeekFromCurrent;
> >>     printf "%d: after seeking back %d\n", $fh.tell, $step_back;
> >>     my $chr_2 =      $fh.readchars(1);
> >>     printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
> >>     is( $chr_1, $chr_2,
> >>     
> >>         "read, seek back, and read again gets same char with nudge of
> >> 
> >> $nudge" );
> >> }
> >> 
> >> 
> >> The output looks like so:
> >> 
> >> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
> >> 0: just opened
> >> 9: after skipping 2
> >> 12: after reading 3rd: ䷼
> >> 6: after seeking back 6
> >> 12: after re-reading 3rd: ䷼
> >> ok 1 - read, seek back, and read again gets same char with nudge of 3
> >> 0: just opened
> >> 2: after skipping 2
> >> 3: after reading 3rd: C
> >> 2: after seeking back 1
> >> 3: after re-reading 3rd: C
> >> ok 2 - read, seek back, and read again gets same char with nudge of 0
> >> 
> >> It's really hard to see what I should do if I really wanted to
> >> intermix readchars and seeks like this... I'd need to check the range
> >> of the codepoint to see how far I need to seek to get where I expect
> >> to be.
> >> 
> >> On 4/24/20, Joseph Brenner <doom...@gmail.com> wrote:
> >> > Thanks, yes I understand unicode and utf-8 reasonably well.
> >> > 
> >> >> So Rakudo has to read the next codepoint to make sure that it isn't a
> >> >> combining codepoint.
> >> >> 
> >> >> It is probably faking up the reads to look right when reading ASCII,
> >> >> but
> >> >> failing to do that for wider codepoints.
> >> > 
> >> > I think it'd be the other way around... the idea here would be it's
> >> > doing an extra readchar behind the scenes just in-case there's
> >> > combining chars involved-- so you're figuring there's some confusion
> >> > about the actual point in the file that's being read and the
> >> > abstraction that readchars is supplying?
> >> > 
> >> > On 4/24/20, Brad Gilbert <b2gi...@gmail.com> wrote:
> >> >> In UTF8 characters can be 1 to 4 bytes long.
> >> >> 
> >> >> UTF8 was designed so that 7-bit ASCII is a subset of it.
> >> >> 
> >> >> Any 8bit byte that has its most significant bit set cannot be ASCII.
> >> >> So multi-byte codepoints have the most significant bit set for all of
> >> >> the
> >> >> bytes.
> >> >> The first byte can tell you the number of bytes that follow it.
> >> >> 
> >> >> That is how a singe codepoint is stored.
> >> >> 
> >> >> A character can be made of several codepoints.
> >> >> 
> >> >>     "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
> >> >>     "é"
> >> >> 
> >> >> So Rakudo has to read the next codepoint to make sure that it isn't a
> >> >> combining codepoint.
> >> >> 
> >> >> It is probably faking up the reads to look right when reading ASCII,
> >> >> but
> >> >> failing to do that for wider codepoints.
> >> >> 
> >> >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doom...@gmail.com>
> >> >> 
> >> >> wrote:
> >> >>> I thought that doing a readchars on a filehandle, seeking backwards
> >> >>> the width of the char in bytes and then doing another read
> >> >>> would always get the same character.  That works for ascii-range
> >> >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
> >> >>> characters (commonly 3-bytes in utf-8).
> >> >>> 
> >> >>> The question then, is why do I need a $nudge of 3 for wide chars, but
> >> >>> not ascii-range ones?
> >> >>> 
> >> >>> use v6;
> >> >>> use Test;
> >> >>> 
> >> >>> my $tmpdir = IO::Spec::Unix.tmpdir;
> >> >>> my $file = "$tmpdir/scratch_file.txt";
> >> >>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";
> >> >>> #
> >> >>> ሀⶀ䷼ꪪⲤⲎ
> >> >>> my $ascii_str =   "ABCDEFGHI";
> >> >>> 
> >> >>> subtest {
> >> >>> 
> >> >>>     my $nudge = 3;
> >> >>>     test_read_and_read_again($unichar_str, $file, $nudge);
> >> >>> 
> >> >>> }, "Wide unicode chars: $unichar_str";
> >> >>> 
> >> >>> subtest {
> >> >>> 
> >> >>>     my $nudge = 0;
> >> >>>     test_read_and_read_again($ascii_str, $file, $nudge);
> >> >>> 
> >> >>> }, "Ascii-range chars: $ascii_str";
> >> >>> 
> >> >>> # write given string to file, then read the third character twice and
> >> >>> check
> >> >>> sub test_read_and_read_again($str, $file, $nudge = 0) {
> >> >>> 
> >> >>>     spurt $file, $str;
> >> >>>     my $fh = $file.IO.open;
> >> >>>     $fh.readchars(2);  # skip a few
> >> >>>     my $chr_1 =      $fh.readchars(1);
> >> >>>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
> >> >>> 
> >> >>> always
> >> >>> 1 or 3
> >> >>> 
> >> >>>     my $step_back = $width + $nudge;
> >> >>>     $fh.seek: -$step_back, SeekFromCurrent;
> >> >>>     my $chr_2 =      $fh.readchars(1);
> >> >>>     is( $chr_1, $chr_2,
> >> >>>     
> >> >>>         "read, seek back, and read again gets same char with nudge of
> >> >>> 
> >> >>> $nudge" );
> >> >>> }


I don't think the utf-16 issue is related. On the topic of readchars. Can 
someone tell me what that readchars script result is unexpected? Maybe I 
missed part of the conversation but if someone can summarize expected vs 
actual result that would be great.

Re: readchars, seek back, and readchars again

Reply via email to