* Joey Hess [2012-04-07 14:57 -0400]: > Carsten Hey wrote: > > +sub read_file_and_hashify { > > + my $file = shift; > > + > > + my ($a_ref, $h_ref) = ([], {}); > > + filemap $file, sub { > > + push @$a_ref, $_; > > + $h_ref->{$_} = 1; > > + }; > > + return ($a_ref, $h_ref); > > +} > > This could be written using filemap.
It uses filemap, maybe I used an unfortunate indentation. > > sub compare_or { > > my ($file1, $file2) = @_; > > > > @@ -105,8 +116,11 @@ sub compare_or { > > sub compare_xor { > > my ($file1, $file2) = @_; > > > > - compare_not($file1, $file2); > > - compare_not($file2, $file1); > > + my ($a1, $h1) = read_file_and_hashify $file1; > > + my ($a2, $h2) = read_file_and_hashify $file2; > > + > > + foreach (@$a1) { print $_, "\n" unless exists $h2->{$_} } > > + foreach (@$a2) { print $_, "\n" unless exists $h1->{$_} } > > This uses something like 4x the memory that it used before. It would be > easy to use only 3x[1]. With a good data structure, it should be > possible to only use 2x. Using an ordered hash would be one way. I have > no experience with Tie::IxHash, but it looks promising. This 3x memory usage implementation is not the one you suggested. I think your's should use less memory, but the difference should be small (this is just for reference, and there is no need to read this code). | sub compare_xor { | my ($file1, $file2) = @_; | | my ($lines2, $seen1, $seen2) = ([], {}, {}); | filemap $file2, | sub { | push @$lines2, $_; | $seen2->{$_} = 1; | }; | filemap $file1, | sub { | print "$_\n" unless exists $seen2->{$_}; | $seen1->{$_} = 1; | }; | foreach (@$lines2) { | print "$_\n" unless $seen1->{$_}; | } | } The above can easily be adapted to use only two times of the original memory, even without an ordered hash: | sub compare_xor { | my ($file1, $file2) = @_; | | my ($lines2, $seen2) = ([], {}); | filemap $file2, | sub { | # Add all lines in file2 to @$lines2 and to %seen2 | # with value 1. | push @$lines2, $_; | $seen2->{$_} = 1; | }; | # Note the usage of 'if' and 'if exists' and their semantic | # difference when reading the remaining code of this sub. | filemap $file1, | sub { | # Print all lines in file1 that are not in file2, | # and mark lines that are in both files by setting | # their value in %seen2 to 0. | if (exists $seen2->{$_}) { | $seen2->{$_} = 0; | } | else { | print "$_\n"; | } | }; | foreach (@$lines2) { | # Print all lines that are in file2 but not in file1. | # The value of these lines in seen2 is set to 1. | print "$_\n" if $seen2->{$_}; | } | } With an ordered hash, it should be possible to further reduce the memory usage of the above. Measuring the memory usage with GNU time results in (combine is the one currently shipped in moreutils, ./combine{2..4} are the variants with two to four times of memory usage): Command being timed: "./combine4 /tmp/unsorted_1 xor /tmp/unsorted_2" Maximum resident set size (kbytes): 90800 Command being timed: "./combine3 /tmp/unsorted_1 xor /tmp/unsorted_2" Maximum resident set size (kbytes): 73920 Command being timed: "./combine2 /tmp/unsorted_1 xor /tmp/unsorted_2" Maximum resident set size (kbytes): 51744 Command being timed: "combine /tmp/unsorted_1 xor /tmp/unsorted_2" Maximum resident set size (kbytes): 36576 I'll send an update after being able to measure the memory usage of a simple implementation in C. Regards Carsten -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org