From: "John W. Krahn" <[EMAIL PROTECTED]>
> "Perl.Org" wrote:
> > 
> > Can anyone share a script that recurses a filesystem for files
> > containing one or more patterns?  Seems like it would be easy to
> > write but if it's already out there...
> 
> This will probably work:
> 
> #!/usr/bin/perl
> use warnings;
> use strict;
> use File::Find;
> 
> my $dir = shift || '.';
> 
> $/ = \2_048;  # set buffer size to 2,048 bytes, YMMV

Wow
 
> find( sub {
>     local @ARGV = $File::Find::name;

Wow

>     my $last = '';
>     while ( <> ) {
>         $_ = $last . $_;
>         if ( /pattern1/ or /pattern2/ or /pattern3/ ) {
>             print "$ARGV\n";
>             close ARGV;
>             return;
>             }
>         $last = $_;
>         }
>     }, $dir );

You are trying to scare everyone away, aren't you? ;-)
Besides changing $/ globaly is not the best thing to do. It will work 
in a tiny script, but once you try to use this code in something 
bigger (or do something more complex with the found files) you are 
bound to run into problems.

I think this would be both safer and more readable:

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;

my $dir = shift || '.';

find( sub {
        unless (open my $FH, $_) {
                print STDERR "Can't open $File::Find::name : $!\n";
                return;
        }
    my ($chunk, $last) = ('','');
    while ( read $FH, $chunk, 2048 ) {
        $chunk = $last . $chunk;
        if ( $chunk=~/pattern1/ or $chunk=~/pattern2/ or 
$chunk=~/pattern3/ ) {
            print "$File::Find::name\n";
            close $FH;
            return;
            }
        $last = $chunk;
        }
    }
    close $FH;
}, $dir );

__END__

Actually there might be a problem with that code. After the first 
iteration $last contains 2048 characters, after the second 4096, ... 
it keeps growing! So if the file is huge and it doesn't contain any 
of the patterns you'll end up with the whole file in memory. Twice.

If you do want to search through the whole file (but start with the 
begginning first you might do something like this:

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;

my $dir = shift || '.';

find( sub {
        unless (open my $FH, $_) {
                print STDERR "Can't open $File::Find::name : $!\n";
                return;
        }
    my ($chunk, $pos) = ('',0);
    while ( read $FH, $chunk, 2048, $pos ) {
        $pos+=2048;
        if ( $chunk=~/pattern1/ or $chunk=~/pattern2/ or 
$chunk=~/pattern3/ ) {
            print "$File::Find::name\n";
            close $FH;
            return;
            }
        }
    }
    close $FH;
}, $dir );

__END__

this way we only have the file in memory once and we do not copy it 
between two variables.

There is still a problem with the code, it is possible to get some 
false positives.

Assume one of the patterns ends with a $. That is it is supposed to 
match at the end of line. But the chunks do not have to end at the 
end of lines, they may end anywhere. And since $ normaly means either 
end of line or end of string, the pattern may match at the end of 
chunk instead of end of line/file. Another possible cause of problem 
is \b. If the pattern ends by \b it may also match incorrectly at the 
end of chunk even if the chunk ends in midword.

To fix this we need something like this:

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;

my $dir = shift || '.';

find( sub {
        unless (open my $FH, $_) {
                print STDERR "Can't open $File::Find::name : $!\n";
                return;
        }
    my ($chunk, $pos, $last_match) = ('', 0, 0);
    while ( read $FH, $chunk, 2048, $pos ) {
        $pos+=2048;
        if ( $chunk=~/pattern1/ or $chunk=~/pattern2/ or 
$chunk=~/pattern3/ ) {
            if ($-[0] == $pos-1) { # matched at the end of chunk
               $last_match = 1;
            } else {
               print "$File::Find::name\n";
               close $FH;
               return;
            }
        } else {
            $last_match = 0;
        }
    }
    close $FH;
    print "$File::Find::name\n" if $last_match; 
                # in case the match at the end of chunk was also at the end of file
}, $dir );

__END__

I think you'd only get a false positive from this if you used look-
aheads. In that case the script would not notice that the match was 
near the end of a chunk and that the look-ahead matched only thanks 
to the end of the chunk.



If the intent was to search two chunks at once to make sure we do not 
miss the pattern because it would be found on the crossing of two 
chunks, we could use something like this:

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;

my $dir = shift || '.';

find( sub {
        unless (open my $FH, $_) {
                print STDERR "Can't open $File::Find::name : $!\n";
                return;
        }
    my ($chunk, $last_match) = ('', 0);
    read $FH, $chunk, 2048 or return;

    if ( $chunk=~/pattern1/ or $chunk=~/pattern2/ or 
$chunk=~/pattern3/ ) {
        if ($-[0] == $pos-1) { # matched at the end of chunk
           $last_match = 1;
        } else {
           print "$File::Find::name\n";
           close $FH;
           return;
        }
    }

    while ( read $FH, $chunk, 2048, 2048) {
        if ( $chunk=~/pattern1/ or $chunk=~/pattern2/ or 
$chunk=~/pattern3/ ) {
            if ($-[0] == $pos-1) { # matched at the end of chunk
               $last_match = 1;
            } elsif ($+[0] == $pos-1) { # matched at the start of 
chunk
               # this is a false match. If it was real 
               # it'd match in the previous iteration.
               $last_match = 0;
            } else {
               print "$File::Find::name\n";
               close $FH;
               return;
            }
        } else {
            $last_match = 0;
        }
        $chunk = substr( $chunk, 2048, 2048);
    }
    close $FH;
    print "$File::Find::name\n" if $last_match; 
                # in case the match at the end of chunk was also at the end of file
}, $dir );

__END__

Of course this would miss a match that would be longer than two 
chunks and could miss one that's longer than one chunk.


There is also the easiest case, if the patterns do not match newlines 
we ran read the file line by line, instead of in chunks:

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;

my $dir = shift || '.';

find( sub {
        unless (open my $FH, $_) {
                print STDERR "Can't open $File::Find::name : $!\n";
                return;
        }
    while ( <$FH> ) {
        chomp;
        if ( /pattern1/ or /pattern2/ or /pattern3/ ) {
            print "$File::Find::name\n";
            close $FH;
            return;
            }
        }
    }
    close $FH;
}, $dir );

__END__


Humpf!

Jenda
P.S.: All code is untested!
===== [EMAIL PROTECTED] === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed 
to get drunk and croon as much as they like.
        -- Terry Pratchett in Sourcery


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to