Re: Comparing files with regular expressions

Aaron Rubinstein Mon, 05 May 2008 07:24:11 -0700

> Given just the idea of the data, can you improve on that?

I bet I could!  It's interesting how my instinct, when trying to develop a
programming solution, is to wrestle with the problem inside the context of
the language.  As a result, the solutions I come up with tend to be shaped
by my limited understanding of that language.  I think you're right that
this is a case of fluency, that I am fluent in English and my best problem
solving skills are most likely in that context.  Trying to solve the problem
in Perl, I'm likely not using my best skills and thus come up with a poor
solution.


I also take from your advice, whether you meant it or not, that I should
approach my code as if it would be scalable.  My solution is probably
adequate for a small scale problem but its silliness would quickly be
exposed as soon as the data scaled up.

Thanks for the advice and inspiration.

On Sat, May 3, 2008 at 8:08 PM, Rob Dixon <[EMAIL PROTECTED]> wrote:

> rubinsta wrote:
> > Hello,
> >
> > I'm a Perl uber-novice and I'm trying to compare two files in order to
> > exclude items listed on one file from the complete list on the other
> > file.  What I have so far prints out a third file listing everything
> > that matches the exclude file from the complete file (which I'm hoping
> > will be a duplicate of the exclude file) just so I can make sure that
> > the comparison script is working.  The files are lists of numbers
> > separated by newlines.  The exclude file has 333 numbers and the
> > complete file has 9000 numbers.
> >
> > Here's what I have so far:
> >
> > #!/usr/bin/perl
> > use strict;
> > use warnings;
> >
> > open(ALL, "all.txt") or die $!;
> > open(EX, "exclude.txt") or die $!;
> > open(OUT,'>exTest.txt') or die $!;
> >
> > my @ex_lines = <EX>;
> > my @all_lines = <ALL>;
> >
> > foreach $all (@all_lines){
> >    foreach $ex (@ex_lines){
> >        if ($ex =~ /(^$all)/){
>
> The lines you have read from the object files are unchomped (include the
> trailing newline character) and there is no allowance for leading or
> trailing
> whitespace. Are you sure of your input data?
>
> The regex has an unnecessary capture (parentheses) and isn't tied at the
> end of
> the string, although leaving the record separator at the end of $ex and
> $all has
> a similar effect.
>
> It should really be simply
>
>  if ($ex eq $all)
>
> >       print OUT $1;
>
> The two strings are equal, so
>
>  print OUT $all;
>
> >        }
> >    }
> > }
> > close(ALL);
> > close(EX);
>
> Explicit closures are pointless unless the status is verified. All open
> filehandles will be closed by Perl when it finishes processing the script.
>
> (Even if an input file doesn't close cleanly, the damage has already been
> done
> when an earlier read failed. If a volume is dismounted while the program
> is
> running, for example, without explicit handling of read errors the file
> will
> simply appear to be shorter than its true length.)
>
> > close(OUT);
>
> There's no need to close output files unless you're in a fragile
> environment, or
>  if it is vital that the output information is complete. For instance it
> may be
> useful to write
>
>  close $output or die $!;
>  unlink 'input.txt';
>
> so that the object data was discarded only if the target data was safely
> written
> and secured.
>
> > I realize the nested foreach loops are ugly but I don't know enough to
> > navigate the filehandles, which as I understand, can only be assigned
> > to variables in their entirety as an array.  Any thoughts on how this
> > might be done?
>
> You should try to solve the problem instead of solving the data. Nearly
> all of
> your code is about opening, reading, and closing files. Your solution
> amounts to:
>
>  if any of the lines in ALL match any of the lines in EX then print (it)
>
> Given just the idea of the data, can you improve on that? For instance, if
> one
> or both of the object files are sorted then you may not need to reassess
> all of
> the lines for each comparison. Or if the lines could occur more than once
> in
> either or both files, then it may be an idea to maintain a record of what
> comparisons had already been made. Those ideas are independent of Perl, or
> indeed of any programming language.
>
> After that, the line blurs. Programming languages are useful thinking
> tools for
> imagining programming solutions, just as natural languages are useful for
> life's
> challenges. An idea expressed in Latin can be impossible to recreate
> intact in
> French, just a solution in Forth can be inexpressible in C++.
>
> But despite its blurriness the line is narrow, so have courage and dash
> cross it
> into the implementation, where all languages have ways to open, close,
> read and
> write files; ways to handle numbers and strings; conveniences for arrays
> and
> constants and, God forbid, error handling.
>
> But I encourage you to start at the beginning, and if common sense is more
> familiar to you than Perl or any other programming language then use that.
> Your
> imagination is your best tool.
>
> If you were given two piles of line printer paper and were told to find
> the
> differences:
>
> - what questions would you ask about the problem?
> - how would you go about it?
> - what would you want to know about the contents?
>
> Once you know the answers, you have a solution. Then you can code it,
> given
> knowledge of the language at hand.
>
> Many things will change the solution, just as you would do things
> differently if
> you had only two sheets of paper to compare, or a two-inch-thick stack.
> Whether
> you had to do it every day or it was somebody else's turn in ten years'
> time.
> Whether it was obvious that all of the lines on one stack of paper were
> the same
> except for a few changes. You get the idea?
>
> But unless it is easier for you to formulate solutions in Perl or any
> other
> language, then imagine a real-world equivalent and use common sense.
>
> Then just code it, and we will help.
>
> HTH,
>
> Rob
>

Re: Comparing files with regular expressions

Reply via email to