jimsgib...@gmail.com (Jim Gibson) writes: > You have a logic error in your code somewhere. Your logic for counting > the files with a name matching a certain pattern is too complicated to > follow and point out where your error lies and how to fix it. The > basic problem is that your logic depends upon detecting when the > directory name changes, and will therefore not work on the last file > encountered. It also depends upon all of the files in a directory > being scanned sequentially. While that seems to work, it is not > guaranteed by File::Find. > > A simpler method for counting files in each directory would be to use > a hash with the directory name as key and the number of files as > value. Here is a program for doing that:
Wow. That really is simpler. Nice to see such succinct code. I guess all those same directory names getting crammed into $counts just get swallowed by that hash characteristic... no need to seek out the uniq name with $counts{$dir}++ = 0 like I was doing. > Note that I recommend keeping the raw numerical data (file counts) and > not saving the output lines. Generate the output lines when you print > them. I saw that, and liked that too. One thing I'm pretty curious about is that when I change your code enough to run it against the real archive .. which was just to change the directory name being fed and change the printf a bit to get the count of returned directories at print out time and the length of the directory names (even shortened) accounted for. [...] >From yours: ------- ------- ---=--- ------- ------- my $startdir = './dir1'; my %counts; find sub { return unless -f; return unless /^\d+$/; $counts{$File::Find::dir}++; }, $startdir; my( $total_dir, $total_files); print "# Files Directory\n------- ---------------\n"; for my $dir ( sort keys %counts ) { (my $shortdir = $dir) =~ s{.*News/agent/nntp}{}; my $nfiles = $counts{$dir}; printf "%5d %s\n", $nfiles, $shortdir; $total_dir++; $total_files += $nfiles; } print "\n<$total_files> files in <$total_dir> directories\n”; ------- ------- ---=--- ------- ------- To this: Yours with slight alterations. ------- ------- ---=--- ------- ------- my $startdir = '/home/gnusu/News/agent/nntp'; my %counts; find sub { return unless -f; return unless /^\d+$/; $counts{$File::Find::dir}++; }, $startdir; my( $total_dir, $total_files); my $gcnt = 0; for my $dir ( sort keys %counts ) { $gcnt++; (my $shortdir = $dir) =~ s{.*News/agent/nntp}{}; my $nfiles = $counts{$dir}; printf "%2d: %-56s %6d\n", $gcnt, $shortdir, $nfiles; $total_dir++; $total_files += $nfiles; } print "\n<$total_files> posts in <$total_dir> directories\n"; ------- ------- ---=--- ------- ------- The outputs of my original code and yours with the alterations are the same: Printing out 44 directories, each with its file count and ending with the totals. <4261964> posts in <44> directories overall What surprised me is the that when I ran them prefaced with the `time' utility, I see the sloppy mess I wrote is nearly twice as fast. I named my own code `cntNews' so named yours `hash_cntNews' time cntNews (the code from OP): [...] showing last 5 lines of output 42: nntp.perl.org/perl/beginners 126568 43: nntp.perl.org/perl/beginners/cgi 13609 44: nntp.perl.org/perl/perl6/users 3967 <4261964> posts in <44> directories overall real 0m15.976s user 0m10.948s sys 0m3.416s ------- ------- ---=--- ------- ------- time hash_cntNews (your code with slight alterations): [...] showing last 5 lines of output 42: /nntp.perl.org/perl/beginners 126568 43: /nntp.perl.org/perl/beginners/cgi 13609 44: /nntp.perl.org/perl/perl6/users 3967 <4261964> posts in <44> directories overall real 0m29.886s user 0m12.892s sys 0m12.824s Even though I'd rather use your code since it is more reliable, because like you say; File::Find does not guarantee mine will always work: But, counting over 4.2 million msgs, can you guess why that time factor might be the way it is? One is nearly twice as fast. I did try them at least a dozen times so far... and the timings come out either the same or nearly so each time. The biggest difference I saw was +- 2 seconds on each code's time. But still remained roughly twice as fast. -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/