in the process of cleaning trying to get rc working with 4-byte
utf-8 sequences, i noticed that rc has a few weak points when
it comes to handling runes that have nothing to do with rune
size.  for example this script
        ; cat badbq
        #!/bin/rc
        nl='
        '
        ifs=α$nl echo `{echo abαβ}

produces this output
        ; cat /n/sources/contrib/quanstro/src/futharc/badbq |
                /n/sources/plan9/386/bin/rc
        ab �

this is because Xbackq reads and checks its input one byte
at a time.  so the first byte of β's two-byte sequence matches
the first byte of α in the ifs.  we're left with a garbage byte
that was the second byte in β's utf sequence.  rio turns this
into Runeerror.

a second problem is in the lexing:
        ; 8.badrune
        #!/bin/rc
        echo hel�;echo 2nd line
        [...]

notice that rc doesn't see the ';' in the echo:
        ; /n/sources/contrib/quanstro/src/futharc/8.badrune | 
                /n/sources/plan9/386/bin/rc
        hel�;echo 2nd line
        [...]

this is because rc assumes good input.  since 0xc0 starts a
two-byte sequence, the second byte doesn't need to get checked.
in fact, xd shows that rc emits bad utf:
        ; /n/sources/contrib/quanstro/src/futharc/8.badrune | 
                /n/sources/plan9/386/bin/rc | xd -c | sed 1q
        0000000   h  e  l c0  ;  e  c  h  o     2  n  d     l  i

both these problems were addressed by adding a rutf()
function to io.c.  rutf keeps enough in the io buffer to
deal with broken utf at any point in the input.  if broken
runes are detected, Runeerror is returned and 1 byte of
input is consumed.  Xbackq was modified to use rutf and
the strstr is now safe, since only complete runes are tested.
lex was also modified to use rutf, and the byte sequence
0xc0 ';' is now interpreted as Runeerror ';'.

the source is in /n/sources/contrib/quanstro/src/futharc
with a pre-compiled executable, 8.out.

- erik

p.s. old differences from standard rc
1.  support for history file
2.  support for break within loops.

p.p.s. other new differences
1.  
x=(
        1
        2
        )
is now acceptable syntax.
2.  Xbackq uses exponential allocation for good behavior on
really long input
3.  print offending line number on syntax errors.


Reply via email to