* Joey Hess [2012-04-07 14:57 -0400]:
> Carsten Hey wrote:
> > +sub read_file_and_hashify {
> > +   my $file = shift;
> > +
> > +   my ($a_ref, $h_ref) = ([], {});
> > +   filemap $file, sub {
> > +           push @$a_ref, $_;
> > +           $h_ref->{$_} = 1;
> > +   };
> > +   return ($a_ref, $h_ref);
> > +}
>
> This could be written using filemap.

It uses filemap, maybe I used an unfortunate indentation.


> >  sub compare_or {
> >     my ($file1, $file2) = @_;
> >
> > @@ -105,8 +116,11 @@ sub compare_or {
> >  sub compare_xor {
> >     my ($file1, $file2) = @_;
> >     
> > -   compare_not($file1, $file2);
> > -   compare_not($file2, $file1);
> > +   my ($a1, $h1) = read_file_and_hashify $file1;
> > +   my ($a2, $h2) = read_file_and_hashify $file2;
> > +
> > +   foreach (@$a1) { print $_, "\n" unless exists $h2->{$_} }
> > +   foreach (@$a2) { print $_, "\n" unless exists $h1->{$_} }
>
> This uses something like 4x the memory that it used before. It would be
> easy to use only 3x[1]. With a good data structure, it should be
> possible to only use 2x. Using an ordered hash would be one way. I have
> no experience with Tie::IxHash, but it looks promising.


This 3x memory usage implementation is not the one you suggested.
I think your's should use less memory, but the difference should be
small (this is just for reference, and there is no need to read this
code).

| sub compare_xor {
|     my ($file1, $file2) = @_;
|
|     my ($lines2, $seen1, $seen2) = ([], {}, {});
|     filemap $file2,
|         sub {
|             push @$lines2, $_;
|             $seen2->{$_} = 1;
|         };
|     filemap $file1,
|         sub {
|             print "$_\n" unless exists $seen2->{$_};
|             $seen1->{$_} = 1;
|         };
|     foreach (@$lines2) {
|         print "$_\n" unless $seen1->{$_};
|     }
| }


The above can easily be adapted to use only two times of the original
memory, even without an ordered hash:

| sub compare_xor {
|     my ($file1, $file2) = @_;
|
|     my ($lines2, $seen2) = ([], {});
|     filemap $file2,
|         sub {
|             # Add all lines in file2 to @$lines2 and to %seen2
|             # with value 1.
|             push @$lines2, $_;
|             $seen2->{$_} = 1;
|         };
|     # Note the usage of 'if' and 'if exists' and their semantic
|     # difference when reading the remaining code of this sub.
|     filemap $file1,
|         sub {
|             # Print all lines in file1 that are not in file2,
|             # and mark lines that are in both files by setting
|             # their value in %seen2 to 0.
|             if (exists $seen2->{$_}) {
|                 $seen2->{$_} = 0;
|             }
|             else {
|                 print "$_\n";
|             }
|         };
|     foreach (@$lines2) {
|         # Print all lines that are in file2 but not in file1.
|         # The value of these lines in seen2 is set to 1.
|         print "$_\n" if $seen2->{$_};
|     }
| }


With an ordered hash, it should be possible to further reduce the memory
usage of the above.


Measuring the memory usage with GNU time results in (combine is the one
currently shipped in moreutils, ./combine{2..4} are the variants with
two to four times of memory usage):

  Command being timed: "./combine4 /tmp/unsorted_1 xor /tmp/unsorted_2"
  Maximum resident set size (kbytes): 90800

  Command being timed: "./combine3 /tmp/unsorted_1 xor /tmp/unsorted_2"
  Maximum resident set size (kbytes): 73920

  Command being timed: "./combine2 /tmp/unsorted_1 xor /tmp/unsorted_2"
  Maximum resident set size (kbytes): 51744

  Command being timed: "combine /tmp/unsorted_1 xor /tmp/unsorted_2"
  Maximum resident set size (kbytes): 36576


I'll send an update after being able to measure the memory usage of
a simple implementation in C.


Regards
Carsten



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to