On Thu, May 1, 2008 at 4:09 PM, rubinsta <[EMAIL PROTECTED]> wrote: > Hello, > > I'm a Perl uber-novice and I'm trying to compare two files in order to > exclude items listed on one file from the complete list on the other > file. What I have so far prints out a third file listing everything > that matches the exclude file from the complete file (which I'm hoping > will be a duplicate of the exclude file) just so I can make sure that > the comparison script is working. The files are lists of numbers > separated by newlines. The exclude file has 333 numbers and the > complete file has 9000 numbers. > > Here's what I have so far: > > #!/usr/bin/perl > use strict; > use warnings; > > open(ALL, "all.txt") or die $!; > open(EX, "exclude.txt") or die $!; > open(OUT,'>exTest.txt') or die $!; snip
Use the three argument version of open and lexical filehandles: open my $ex, "<", "exclude.txt" or die "could not open exclude.txt: $!"; snip > > my @ex_lines = <EX>; > my @all_lines = <ALL>; snip Using filehandles in list context is a bad idea. It may work now when the files are small, but data almost always grows. Unless you are certain that the file will remain small you should not do this. Use a while loop instead. snip > > foreach $all (@all_lines){ > foreach $ex (@ex_lines){ > if ($ex =~ /(^$all)/){ This is testing to see if there are any lines in the exclude file that start with what was in the complete file. That is if the complete file was 1 2 and the exclude file was 10 20 then all lines would be excluded. Is this really what you want? Also, given that you have not surrounded $all with \Q and \E (like /^\Q$all\E/) and metacharacters in $all (like *, ., ?, etc.) will be treated as metacharacters instead of normal characters. Unless the lines in complete are know to be regexes this could be bad. And by bad I mean everything from mismatches to the dreaded "(?{system qq(rm -rf $ENV{HOME})})". If you don't have regexes in the complete file but do want to check for its entires as prefixes in the exclude file, you are better off using a prefix tree (aka a trie*). It is an O(m log n)** algorithm, as opposed to the O(n*m) algorithm you are using now. There is at least one Perl implementation: Tree::Trie***. If you don't have regexes in the complete file and do not want to check for entries as prefixes in the exclude file you are better off using a hash set***** to test for existence (roughly an O(m+n) solution). Luckily in Perl a hash set is easy to build, you just use a hash variable with the keys being your data and the values all being either undef or 1 depending on your style (I tend to use 1 for simplicity's sake, but I think undef might be smaller). Using a hash also gives you the freedom to use something like DB_FILE****** if the files get very large (thus saving memory without having to add much code. snip > print OUT $1; > } > } > } > close(ALL); > close(EX); > close(OUT); snip These calls to close at the end of the script are unnecessary. Only call close explicitly if you need to close a file before the filehandle goes out of scope. Another simple tip is to treat STDIN/files on the command line as your complete file and STDOUT as your output file. This form of Perl script is called a filter and is very easy to write and use. What follows is my implementation of the hash set version: #!/usr/bin/perl use strict; use warnings; #this is a hack to make the script runnable #without external data files, in a normal #script you would open a real exclude file #here my $exclude = "1\n2\n3\n"; open my $ex, "<", \$exclude or die "could not open the scalar \$exculde as a file: $!"; my %exists; $exists{$_} = 1 while <$ex>; #this is also a hack, in a normal script #you would say #while (my $line = <>) { #to get a loop over STDIN or files specified #on the commandline while (my $line = <DATA>) { print $line unless $exists{$line}; } __DATA__ 1 2 10 20 * http://en.wikipedia.org/wiki/Trie ** This is big O notation****, basically it measure the order of magnitude of number of steps needed to complete the algorithm. So, if you had 1,000 lines in exclude and 10,000 lines in complete it would take roughly 10,000,000 steps to complete the algorithm you are using now and only 13,287 with the trie. *** http://search.cpan.org/~avif/Tree-Trie-1.5/Trie.pm **** http://en.wikipedia.org/wiki/Big_O_notation ***** basically a hash with no values used for testing of existance of values ****** http://perldoc.perl.org/DB_File.html -- Chas. Owens wonkden.net The most important skill a programmer can have is the ability to read. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/