Re: RFC 316 (v1) Regex modifier for support of chunk processing and prefix matching

Bart Lateur Sat, 30 Sep 2000 01:24:19 -0700
On Sat, 30 Sep 2000 00:23:13 +0100, Hugo wrote:

>This is a strength of RFC 93 however, since in that context we
>don't need to restart the match each time we go off to fetch more
>data. In that situation if we run out of data after the 1234E2+2
>we fail the attempt to widen the \d+, match forward to the $, and
>are immediately finished.

Yes, but RFC 93 has some other disadvantages.

Look at the template of the sub we need for every callback funtion:

        sub s {
            if ($_[1]) {                # "putback unused data" request
                recache($_[0]);
            }
            else {                      # "send more data" request
                return get_chars(max=>$_[0])
            }
        }

This is not pretty, especially since recache() is not even defined yet.

Furthermore, where is the resulting buffer stored? People usually still
want a copy of their data, to do yet other things with. Here, the data
has disappeared into thin air. The only way to get it, is putting
capturing parens in the regex.

As a consequence, the regex shouldn't read any more characters than it
actually eats. So, reading and pushing back of the data will almost have
to be per byte. That's what RFC 93 says, too:

>The single
>argument would specify how many characters should be returned (typically
>this would be 1, unless internal analysis by the regex engine can deduce
>that more than one character will be required)

Imagine that you have a data file of 1Mb has to be processed. That is a
minimum: it hardly makes sense to process much smaller files in chunks,
because it willlikely just fit in memory as a whole. Imagine that
typically, your regex needs to proces and backtrack over each character
five times. That is 5 reads, and 4 pushbacks. That is a rather
conservative estimate, I think, for complex regexes. That means 9
invocations of this sub *per character*, or 9 million callback function
calls for your 1 Mb data file. I won't even like to start to think about
the effect this will have on the processing time required. The idea
would probably be OK if this was C, but it is not.

Imagine how my mechanism would do it. First of all, your getting and
storing of data all happen manually, so you have it a your disposal for
whatever else you'd like to use it. Let's make it small chunks of 1k. A
1Mb file then will be processed in (roughly) 1000 chunks. Add the need
for redoing the regex without the '/z' modifier at the end of the file,
that makes a total of 1001. Compare that to the 9 million callback calls
of RFC 93.

Look, I don't think that these two approaches really exclude one
another. There's no conflict. It is possible to implement both.

And finally: I'm not married to the interface. That might change
completely. All suggestions welcome. But I like the cheap way of making
the regex tell me that it needs more data to make up its mind for 100%.

Modifying a script that was written to process dat in lines, so that now
it can work with multiline data (multiline CSV files, HTML files with
tags split over several lines, ...) really requires a relatively small
change to your script. *That* is one of the features I really like.

Compared to that, RFC 93 feels like a straightjacket. To me. You may
have to completely rewrite your script. So much for code reuse.

-- 
        Bart.
Re: RFC 316 (v1) Regex modifier for support of chunk processing and prefix matching

Reply via email to