On Tue, May 10, 2005 at 10:22:35PM +0200, Jens Rieks wrote: > On Tuesday 10 May 2005 20:29, Patrick R. Michaud wrote: > > This is *excellent*. > > > > However, now that I look at things, I'm wondering if a slight > > change to the specification would be in order -- in my original > > post I said that find_cclass and find_not_cclass would return -1 > > whenever they (didn't find | found) the character of interest -- > > perhaps it would be better for them to return the length of > > the string instead? > > Yes, sounds reasonable.
...well, in looking at it some more it's reasonable until I see that returning -1 is the way the other find_* ops work. So, part of me thinks we should either be consistent with those, or make the others consistent with the interpretation I gave above, or rename the find_cclass and find_not_cclass ops to something different (perhaps "span_cclass" and "span_not_cclass") so as to avoid confusion, or deprecate the pre-existing find_* ops. Any suggestions from the peanut gallery about how this should work? If I were designing for the long run I'd make the change, but that's really not my decision to make, and I'll happily live with any decision made as long as there's some sort of character class support (and will likely generate+submit a patch for it). I only bring this up because I'm asking myself what we'll want in the long run, as opposed to what we might have now . Some notes about the motivation for the change below... Pm ----- Why find_not_cclass is useful, and why it's useful to return the length of string instead of an error (-1) indicator: As many of you know, I've written the Perl Grammar Engine, and I'm also working on another parser. In parsing expressions there are many times where I want to be able to skip over a sequence of characters until I find one that is not in a certain class -- i.e., for grabbing sequences of digits, word characters, skipping whitespace, etc. For example, before having find_cclass and find_not_cclass, grabbing a sequence of digits in Parrot required a loop, as in: .local string target # string to scan .local int pos # current scanning position $I0 = pos # keep start of position digits_loop: $I1 = is_digit target, pos unless $I1 goto digits_end inc pos goto digits_loop digits_end: $I1 = pos - $I0 $S0 = substr target, $I0, $I1 # extract sequence of digits Having find_not_cclass can make this much simpler .local string target # string to scan .local int pos # current scanning position $I0 = pos # keep start of position pos = find_not_cclass .CCLASS_NUMERIC, target, pos $I1 = pos - $I0 $S0 = substr target, $I0, $I1 # extract sequence of digits Similarly, I often want to skip comments (e.g., '#' until the next newline): .local string target # string to scan .local int pos # current scanning position $S0 = substr target, pos, 1 unless $S0 == "#" goto comment_skipped inc pos pos = find_cclass .CCLASS_NEWLINE, target, pos comment_skipped: # ... Unfortunately, if find_cclass and find_not_cclass return -1 when the desired character isn't available, I have to check for this special return value and handle it accordingly: .local string target # string to scan .local int pos # current scanning position $I0 = pos # keep start of position pos = find_not_cclass .CCLASS_NUMERIC, target, pos unless pos == -1 goto digit_1 pos = length target digit_1: $I1 = pos - $I0 $S0 = substr target, $I0, $I1 # extract sequence of digits and .local string target # string to scan .local int pos # current scanning position $S0 = substr target, pos, 1 unless $S0 == "#" goto comment_skipped inc pos pos = find_cclass .CCLASS_NEWLINE, target, pos unless pos == -1 goto comment_skipped pos = length target comment_skipped: # ... So, in these particular instances, receiving the length of the string is much more useful than the -1 value. Of course, the -1 value is a quick indicator that the desired character class wasn't found; without it we have to check the return value against the string length if we want to know if the character position wasn't located. For example, instead of: $I1 = find_not_cclass .CCLASS_DIGIT, $S0, $I0 if $I1 == -1 goto only_digits_left with find_char returning the length of string it becomes $I2 = length $S0 $I1 = find_not_cclass .CCLASS_DIGIT, $S0, $I0 if $I1 == $I2 goto only_digits_left In the things I'm writing thus far, I'm having to keep the length of the string in a register anyway to know when the scanner has reached the end of input. So, for me, using the string length to indicate that the desired character isn't found doesn't cost me anything over the -1 value. What's best for the long term?