On Sat, 30 Sep 2000 00:23:13 +0100, Hugo wrote: >This is a strength of RFC 93 however, since in that context we >don't need to restart the match each time we go off to fetch more >data. In that situation if we run out of data after the 1234E2+2 >we fail the attempt to widen the \d+, match forward to the $, and >are immediately finished. Yes, but RFC 93 has some other disadvantages. Look at the template of the sub we need for every callback funtion: sub s { if ($_[1]) { # "putback unused data" request recache($_[0]); } else { # "send more data" request return get_chars(max=>$_[0]) } } This is not pretty, especially since recache() is not even defined yet. Furthermore, where is the resulting buffer stored? People usually still want a copy of their data, to do yet other things with. Here, the data has disappeared into thin air. The only way to get it, is putting capturing parens in the regex. As a consequence, the regex shouldn't read any more characters than it actually eats. So, reading and pushing back of the data will almost have to be per byte. That's what RFC 93 says, too: >The single >argument would specify how many characters should be returned (typically >this would be 1, unless internal analysis by the regex engine can deduce >that more than one character will be required) Imagine that you have a data file of 1Mb has to be processed. That is a minimum: it hardly makes sense to process much smaller files in chunks, because it willlikely just fit in memory as a whole. Imagine that typically, your regex needs to proces and backtrack over each character five times. That is 5 reads, and 4 pushbacks. That is a rather conservative estimate, I think, for complex regexes. That means 9 invocations of this sub *per character*, or 9 million callback function calls for your 1 Mb data file. I won't even like to start to think about the effect this will have on the processing time required. The idea would probably be OK if this was C, but it is not. Imagine how my mechanism would do it. First of all, your getting and storing of data all happen manually, so you have it a your disposal for whatever else you'd like to use it. Let's make it small chunks of 1k. A 1Mb file then will be processed in (roughly) 1000 chunks. Add the need for redoing the regex without the '/z' modifier at the end of the file, that makes a total of 1001. Compare that to the 9 million callback calls of RFC 93. Look, I don't think that these two approaches really exclude one another. There's no conflict. It is possible to implement both. And finally: I'm not married to the interface. That might change completely. All suggestions welcome. But I like the cheap way of making the regex tell me that it needs more data to make up its mind for 100%. Modifying a script that was written to process dat in lines, so that now it can work with multiline data (multiline CSV files, HTML files with tags split over several lines, ...) really requires a relatively small change to your script. *That* is one of the features I really like. Compared to that, RFC 93 feels like a straightjacket. To me. You may have to completely rewrite your script. So much for code reuse. -- Bart.