On Thu, Apr 21, 2011 at 01:42:42PM -0400, Marc Perry wrote:
> Hi,
> 
> I was parsing a collection of HTML files where I wanted to extract a certain
> block from each file, like this:

This is where everyone will tell you to use some dedicated HTML parsing
module.

> > ./script.pl *.html
> 
> my $accumulator;
> my $capture_counter;
> 
> while ( <> ) {
>     if ( /<h1>/.../labelsub/ ) {
>         $accumulator .= $_ unless /labelsub/;
>         if ( /labelsub/ && !$capture_counter ) {
>             print $accumulator;
>             $capture_counter = 1;
>         }
>         else {
>             next;
>         }
>     }
>     else {
>         next;
>     }
> }
> continue { # flush out the variables and clean up
>    if ( eof ) {
>         close ARGV;
>         $accumulator = '';
>         $capture_counter = '';
>     }
> }
> 
> The bit about the $capture_counter is because some of the files have
> multiple blocks of text that could be accumulated, and I only want the first
> block in the file.
> 
> This usually works fine, until I encountered an input file that did not
> contain the string 'labelsub' after the first '<h1>' regex pattern match.
> Then the conditional if test continued to search in the incoming lines in
> the next file (because I am processing a whole batch using the while (<>)
> operator), which it eventually found, and then printed nothing, because at
> the end-of-file of the previous file, the script flushed the contents of the
> accumulator.
> 
> One solution is to just run the same script individually on each file, but I
> was wondering if there was a way to reset the 'state' of the range operator
> pattern match at the end of the physical file (or at any other time for that
> matter)?

No, there isn't (unless you want to get fancy and use a closure or
something) and so you'll need to find some other way to "end" the range.
The obvious other end point is the end of file, and so you can have your
range operator as:

    if ( /<h1>/ ... /labelsub/ || eof ) {

This will ensure that the range operator "ends" by the end of each file,
but you'd need to do extra work because of the logic of the rest of your
program.  So let's see if we can do something about that.

Whilst it doesn't make a difference to the logic, I prefer to jump out
of a loop early if I find it doesn't satisfy the conditions I'm looking
for.  So I think that:

    next unless /<h1>/ .. /labelsub/ || eof;

looks tidier than the if else conditional.

Then there's your logic to ensure you only count the first block in each
file.  Perl has the little-known ?? counterpart to // which will only
match once.  So making that line:

    next unless ?<h1>? .. /labelsub/ || eof;

Allows you to get rid of the $capture_counter variable.  But you'll need
to add a reset to the continue block, to reset the ?? at the start of a
new file.

Finally, with this change you may as well just print $accumulator in the
continue block too.  So we end up with

    my $accumulator;

    while ( <> ) {
        next unless ?<h1>? .. /labelsub/ || eof;
        $accumulator .= $_ unless /labelsub/;
    }
    continue { # flush out the variables and clean up
        if ( eof ) {
            print $accumulator;
            $accumulator = '';
            reset;
        }
    }

which, I think, does what you are after.

The docs mention that ?? is vaguely deprecated:

    This usage is vaguely deprecated, which means it just might possibly
    be removed in some distant future version of Perl, perhaps somewhere
    around the year 2168.

That doesn't sound too bad, but there was some talk of an earlier
deprecation of the bare ?? syntax, so it might be safer to use m??
instead.

Interestingly (for me), this is the first time in over 20 years that I
have found a legitimate use for ??, and the associated reset.

-- 
Paul Johnson - p...@pjcj.net
http://www.pjcj.net

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to