This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Regex modifier for support of chunk processing and prefix matching

=head1 VERSION

  Maintainer: Bart Lateur <[EMAIL PROTECTED]>
  Date: 23 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 316
  Version: 1
  Status: Developing

=head1 ABSTRACT

This RFC proposes to add a regex modifier, which allows to reliably 
recognize possible matches that span multiple records. In addition, it 
can recognize if a string is/ends with a string that may possibly be 
the start of a matching string.

=head1 DESCRIPTION

Currently, string processing happens using one of two approaches:

=over 2

=item *

on a line by line (record by record) basis

=item *

the whole string under test is loaded in memory

=back

Sometimes, you are not able to reliably find a way to split up the 
input so that no match could possibly span multiple input records. But 
at the same time, the input file is too big to be loaded in memory in 
one piece. What to do?

If you just split the data into chunks of arbitrary length and process 
them independently, you may have the problem that some matching strings 
aren't recognized, or worse still, that the wrong kind of match is found
instead. 

=head2 Example

You want to extract occurrences of the string 'abcd', or, as a second 
choice, 'bc', using the regex:

    /abcd|bc/
    
You process this in chunks of 1k each. It can happen that an occurrence 
of the string abcd is broken into two parts, between chunks. What will 
happen, is one of the following:

=for html

<pre>

=end

    end of  start of    /abcd/bc/   
    first   second      will allow
    chunk   chunk       you to find:
    ------  --------    ------------
             abcd        abcd
     a       bcd         bc
     ab      cd         (no match)
     abc     d           bc
     abcd                abcd

=for html

</pre>

=end

=head2 Remedy

My solution is to still use the same regex, but with an extra modifier. 
This modifier makes the regex engine abort its search, as soon as the 
end of the string is prematurely reached. In the above example, if "abc"
is found, no attempt would be made to try to match it against "bc".

For this modifier, I chose the letter 'z', for its mnemonic relationship
with "\z", the regex marker for end-of-string.

In addition, pos() is set to the offset of the start of the recognized 
match prefix. In case of a plain succesful match, or of a normal 
not-found termination, pos is undef() on exit.

This serves both as a flag, as pos will only be defined if the search 
has been aborted for this reason, and it allows more optimized
searching, 
because after you have appended the next chunk to the current one, the 
next try will simply start again at the position where the pattern may 
first match, skipping any earlier matches. In the above example, that 
would be at the "a".

Exception-happy people would most likely implement it using exceptions, 
and throw an exception when the end of the buffer is prematurely 
reached. But I'm not like that. Reaching the end of a string is not an 
exception.

=head2 Application

The following code can do the task described, provided that matches are 
rare enough so that the buffer probably won't grow too big:

    my $chunksize = 1024;
    while(read FH, my $buffer, $chunksize) {
        while(/(abcd|bc)/gz) {
            # do something boring with the matched string:
            print "$1\n";
        }
        if(defined pos) {  # end-of-buffer exception
            # append the next chunk to the current one
            read FH, $buffer, $chunksize, length $buffer;
            # retry matching
            redo;
        }
    }

    
=head2 Procedural programming style

As you can see from the example code, the program flow stays very close 
to what people would ordinarily program under normal circumstances.

By contrast, RFC 93 proposes another solution to the same problem, but 
using callbacks. Since the same sub must do one of several things, the 
first thing that needs to be done is to channel different kinds of 
requests to their own handler. As a result, you need a complete rewrite 
from what you'd use in the ordinary case.

I think that a lot of people will find my approach far less
intimidating.

=head2 Match prefix

It can be useful to be able to recognize if a string could possibly be a
prefix for a potential match. For example in an interactive program, 
you want to allow a user to enter a number into an input field, but 
nothing else. After every single keystroke, you can test what he just 
entered against a regex matching the valid format for a number, so that 
C<1234E> can be recognized as a prefix for the regex

    /^\d+\.?\d*(?:E[+-]?\d+)$/
    

I originally had thought of providing a separate, dedicated regex 
modifier, just for the match prefix, but I don't think too many people 
need this that desperately. You can easily build a working application 
with just the '/z' modifier. If you can't, you're in over your head, 
anyway.  ;-)
    
=head1 IMPLEMENTATION

The regex engine must be given the option to abort the search, and 
B<not> do any backtracking, as soon as it attempts to do a lookahead to 
the next character, but it finds the end of the string instead.

In addition, pos() must be patched so that under this modifier, 
it normally is reset to undef(), but when this exception takes place, 
pos is set to the offset of the start of the prefix.

=head1 REFERENCES

RFC 93: Regex: Support for incremental pattern matching


Reply via email to