Re: Comparing files with regular expressions

Chas. Owens Fri, 02 May 2008 04:42:14 -0700

On Thu, May 1, 2008 at 4:09 PM, rubinsta <[EMAIL PROTECTED]> wrote:
> Hello,
>
>  I'm a Perl uber-novice and I'm trying to compare two files in order to
>  exclude items listed on one file from the complete list on the other
>  file.  What I have so far prints out a third file listing everything
>  that matches the exclude file from the complete file (which I'm hoping
>  will be a duplicate of the exclude file) just so I can make sure that
>  the comparison script is working.  The files are lists of numbers
>  separated by newlines.  The exclude file has 333 numbers and the
>  complete file has 9000 numbers.
>
>  Here's what I have so far:
>
>  #!/usr/bin/perl
>  use strict;
>  use warnings;
>
>  open(ALL, "all.txt") or die $!;
>  open(EX, "exclude.txt") or die $!;
>  open(OUT,'>exTest.txt') or die $!;
snip


Use the three argument version of open and lexical filehandles:

open my $ex, "<", "exclude.txt"
    or die "could not open exclude.txt: $!";

snip
>
>  my @ex_lines = <EX>;
>  my @all_lines = <ALL>;
snip

Using filehandles in list context is a bad idea.  It may work now when
the files are small, but data almost always grows.  Unless you are
certain that the file will remain small you should not do this.  Use a
while loop instead.

snip
>
>  foreach $all (@all_lines){
>    foreach $ex (@ex_lines){
>        if ($ex =~ /(^$all)/){

This is testing to see if there are any lines in the exclude file that
start with what was in the complete file.  That is if the complete
file was

1
2

and the exclude file was

10
20

then all lines would be excluded.  Is this really what you want?
Also, given that you have not surrounded $all with \Q and \E (like
/^\Q$all\E/) and metacharacters in $all (like *, ., ?, etc.) will be
treated as metacharacters instead of normal characters.  Unless the
lines in complete are know to be regexes this could be bad.  And by
bad I mean everything from mismatches to the dreaded "(?{system qq(rm
-rf $ENV{HOME})})".

If you don't have regexes in the complete file but do want to check
for its entires as prefixes in the exclude file, you are better off
using a prefix tree (aka a trie*).  It is an O(m log n)** algorithm,
as opposed to the O(n*m) algorithm you are using now.  There is at
least one Perl implementation: Tree::Trie***.

If you don't have regexes in the complete file and do not want to
check for entries as prefixes in the exclude file you are better off
using a hash set***** to test for existence (roughly an O(m+n)
solution).  Luckily in Perl a hash set is easy to build, you just use
a hash variable with the keys being your data and the values all being
either undef or 1 depending on your style (I tend to use 1 for
simplicity's sake, but I think undef might be smaller).  Using a hash
also gives you the freedom to use something like DB_FILE****** if the
files get very large (thus saving memory without having to add much
code.

snip
>         print OUT $1;
>        }
>    }
>  }
>  close(ALL);
>  close(EX);
>  close(OUT);
snip

These calls to close at the end of the script are unnecessary.  Only
call close explicitly if you need to close a file before the
filehandle goes out of scope.

Another simple tip is to treat STDIN/files on the command line as your
complete file and STDOUT as your output file.  This form of Perl
script is called a filter and is very easy to write and use.  What
follows is my implementation of the hash set version:

#!/usr/bin/perl

use strict;
use warnings;

#this is a hack to make the script runnable
#without external data files, in a normal
#script you would open a real exclude file
#here
my $exclude = "1\n2\n3\n";
open my $ex, "<", \$exclude
    or die "could not open the scalar \$exculde as a file: $!";

my %exists;
$exists{$_} = 1 while <$ex>;

#this is also a hack, in a normal script
#you would say
#while (my $line = <>) {
#to get a loop over STDIN or files specified
#on the commandline
while (my $line = <DATA>) {
        print $line unless $exists{$line};
}

__DATA__
1
2
10
20


* http://en.wikipedia.org/wiki/Trie
** This is big O notation****, basically it measure the order of
magnitude of number of steps needed to complete the algorithm.  So, if
you had 1,000 lines in exclude and 10,000 lines in complete it would
take roughly 10,000,000 steps to complete the algorithm you are using
now and only 13,287 with the trie.
*** http://search.cpan.org/~avif/Tree-Trie-1.5/Trie.pm
**** http://en.wikipedia.org/wiki/Big_O_notation
***** basically a hash with no values used for testing of existance of values
****** http://perldoc.perl.org/DB_File.html

-- 
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: Comparing files with regular expressions

Reply via email to