This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Regex modifier for support of chunk processing and prefix matching =head1 VERSION Maintainer: Bart Lateur <[EMAIL PROTECTED]> Date: 23 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 316 Version: 1 Status: Developing =head1 ABSTRACT This RFC proposes to add a regex modifier, which allows to reliably recognize possible matches that span multiple records. In addition, it can recognize if a string is/ends with a string that may possibly be the start of a matching string. =head1 DESCRIPTION Currently, string processing happens using one of two approaches: =over 2 =item * on a line by line (record by record) basis =item * the whole string under test is loaded in memory =back Sometimes, you are not able to reliably find a way to split up the input so that no match could possibly span multiple input records. But at the same time, the input file is too big to be loaded in memory in one piece. What to do? If you just split the data into chunks of arbitrary length and process them independently, you may have the problem that some matching strings aren't recognized, or worse still, that the wrong kind of match is found instead. =head2 Example You want to extract occurrences of the string 'abcd', or, as a second choice, 'bc', using the regex: /abcd|bc/ You process this in chunks of 1k each. It can happen that an occurrence of the string abcd is broken into two parts, between chunks. What will happen, is one of the following: =for html <pre> =end end of start of /abcd/bc/ first second will allow chunk chunk you to find: ------ -------- ------------ abcd abcd a bcd bc ab cd (no match) abc d bc abcd abcd =for html </pre> =end =head2 Remedy My solution is to still use the same regex, but with an extra modifier. This modifier makes the regex engine abort its search, as soon as the end of the string is prematurely reached. In the above example, if "abc" is found, no attempt would be made to try to match it against "bc". For this modifier, I chose the letter 'z', for its mnemonic relationship with "\z", the regex marker for end-of-string. In addition, pos() is set to the offset of the start of the recognized match prefix. In case of a plain succesful match, or of a normal not-found termination, pos is undef() on exit. This serves both as a flag, as pos will only be defined if the search has been aborted for this reason, and it allows more optimized searching, because after you have appended the next chunk to the current one, the next try will simply start again at the position where the pattern may first match, skipping any earlier matches. In the above example, that would be at the "a". Exception-happy people would most likely implement it using exceptions, and throw an exception when the end of the buffer is prematurely reached. But I'm not like that. Reaching the end of a string is not an exception. =head2 Application The following code can do the task described, provided that matches are rare enough so that the buffer probably won't grow too big: my $chunksize = 1024; while(read FH, my $buffer, $chunksize) { while(/(abcd|bc)/gz) { # do something boring with the matched string: print "$1\n"; } if(defined pos) { # end-of-buffer exception # append the next chunk to the current one read FH, $buffer, $chunksize, length $buffer; # retry matching redo; } } =head2 Procedural programming style As you can see from the example code, the program flow stays very close to what people would ordinarily program under normal circumstances. By contrast, RFC 93 proposes another solution to the same problem, but using callbacks. Since the same sub must do one of several things, the first thing that needs to be done is to channel different kinds of requests to their own handler. As a result, you need a complete rewrite from what you'd use in the ordinary case. I think that a lot of people will find my approach far less intimidating. =head2 Match prefix It can be useful to be able to recognize if a string could possibly be a prefix for a potential match. For example in an interactive program, you want to allow a user to enter a number into an input field, but nothing else. After every single keystroke, you can test what he just entered against a regex matching the valid format for a number, so that C<1234E> can be recognized as a prefix for the regex /^\d+\.?\d*(?:E[+-]?\d+)$/ I originally had thought of providing a separate, dedicated regex modifier, just for the match prefix, but I don't think too many people need this that desperately. You can easily build a working application with just the '/z' modifier. If you can't, you're in over your head, anyway. ;-) =head1 IMPLEMENTATION The regex engine must be given the option to abort the search, and B<not> do any backtracking, as soon as it attempts to do a lookahead to the next character, but it finds the end of the string instead. In addition, pos() must be patched so that under this modifier, it normally is reset to undef(), but when this exception takes place, pos is set to the offset of the start of the prefix. =head1 REFERENCES RFC 93: Regex: Support for incremental pattern matching