Re: patch to is_cclass for offset beyond end of string

Patrick R. Michaud Wed, 11 May 2005 04:37:49 -0700

On Tue, May 10, 2005 at 10:22:35PM +0200, Jens Rieks wrote:
> On Tuesday 10 May 2005 20:29, Patrick R. Michaud wrote:
> > This is *excellent*.
> >
> > However, now that I look at things, I'm wondering if a slight
> > change to the specification would be in order -- in my original
> > post I said that find_cclass and find_not_cclass would return -1
> > whenever they (didn't find | found) the character of interest --
> > perhaps it would be better for them to return the length of
> > the string instead?
>
> Yes, sounds reasonable.


...well, in looking at it some more it's reasonable until I see
that returning -1 is the way the other find_* ops work.  So,
part of me thinks we should either be consistent with those, or
make the others consistent with the interpretation I gave above, or
rename the find_cclass and find_not_cclass ops to something different
(perhaps "span_cclass" and "span_not_cclass") so as to avoid
confusion, or deprecate the pre-existing find_* ops.

Any suggestions from the peanut gallery about how this should
work?  If I were designing for the long run I'd make the change,
but that's really not my decision to make, and I'll happily live
with any decision made as long as there's some sort of character
class support (and will likely generate+submit a patch for it).
I only bring this up because I'm asking myself what we'll want in
the long run, as opposed to what we might have now .

Some notes about the motivation for the change below...

Pm

-----

Why find_not_cclass is useful, and why it's useful to return the
length of string instead of an error (-1) indicator:

As many of you know, I've written the Perl Grammar Engine, and I'm
also working on another parser.  In parsing expressions there are
many times where I want to be able to skip over a sequence of
characters until I find one that is not in a certain class --
i.e., for grabbing sequences of digits, word characters, skipping
whitespace, etc.

For example, before having find_cclass and find_not_cclass, grabbing
a sequence of digits in Parrot required a loop, as in:

    .local string target       # string to scan
    .local int pos             # current scanning position
    $I0 = pos                  # keep start of position
  digits_loop:
    $I1 = is_digit target, pos
    unless $I1 goto digits_end
    inc pos
    goto digits_loop
  digits_end:
    $I1 = pos - $I0
    $S0 = substr target, $I0, $I1    # extract sequence of digits

Having find_not_cclass can make this much simpler

    .local string target       # string to scan
    .local int pos             # current scanning position
    $I0 = pos                  # keep start of position
    pos = find_not_cclass .CCLASS_NUMERIC, target, pos
    $I1 = pos - $I0
    $S0 = substr target, $I0, $I1    # extract sequence of digits

Similarly, I often want to skip comments (e.g., '#' until the next
newline):

    .local string target       # string to scan
    .local int pos             # current scanning position
    $S0 = substr target, pos, 1
    unless $S0 == "#" goto comment_skipped
    inc pos
    pos = find_cclass .CCLASS_NEWLINE, target, pos
  comment_skipped:
    # ...

Unfortunately, if find_cclass and find_not_cclass return -1 when the
desired character isn't available, I have to check for this special
return value and handle it accordingly:

    .local string target       # string to scan
    .local int pos             # current scanning position
    $I0 = pos                  # keep start of position
    pos = find_not_cclass .CCLASS_NUMERIC, target, pos
    unless pos == -1 goto digit_1
    pos = length target
  digit_1:
    $I1 = pos - $I0
    $S0 = substr target, $I0, $I1    # extract sequence of digits

and

    .local string target       # string to scan
    .local int pos             # current scanning position
    $S0 = substr target, pos, 1
    unless $S0 == "#" goto comment_skipped
    inc pos
    pos = find_cclass .CCLASS_NEWLINE, target, pos
    unless pos == -1 goto comment_skipped
    pos = length target
  comment_skipped:
    # ...

So, in these particular instances, receiving the length of the
string is much more useful than the -1 value.  Of course, the -1
value is a quick indicator that the desired character class
wasn't found; without it we have to check the return value
against the string length if we want to know if the character
position wasn't located.  For example, instead of:

    $I1 = find_not_cclass .CCLASS_DIGIT, $S0, $I0
    if $I1 == -1 goto only_digits_left

with find_char returning the length of string it becomes

    $I2 = length $S0
    $I1 = find_not_cclass .CCLASS_DIGIT, $S0, $I0
    if $I1 == $I2 goto only_digits_left

In the things I'm writing thus far, I'm having to keep the
length of the string in a register anyway to know when the
scanner has reached the end of input.  So, for me, using
the string length to indicate that the desired character
isn't found doesn't cost me anything over the -1 value.

What's best for the long term?

Re: patch to is_cclass for offset beyond end of string

Reply via email to