jimsgib...@gmail.com (Jim Gibson) writes:

> You have a logic error in your code somewhere. Your logic for counting
> the files with a name matching a certain pattern is too complicated to
> follow and point out where your error lies and how to fix it. The
> basic problem is that your logic depends upon detecting when the
> directory name changes, and will therefore not work on the last file
> encountered. It also depends upon all of the files in a directory
> being scanned sequentially. While that seems to work, it is not
> guaranteed by File::Find.
>
> A simpler method for counting files in each directory would be to use
> a hash with the directory name as key and the number of files as
> value. Here is a program for doing that:

Wow.  That really is simpler.  Nice to see such succinct code.

I guess all those same directory names getting crammed into $counts
just get swallowed by that hash characteristic... no need to seek out
the uniq name with $counts{$dir}++ = 0 like I was doing. 

> Note that I recommend keeping the raw numerical data (file counts) and
> not saving the output lines. Generate the output lines when you print
> them.

I saw that, and liked that too.

One thing I'm pretty curious about is that when I change your code
enough to run it against the real archive .. which was just to change
the directory name being fed and change the printf a bit to get the
count of returned directories at print out time and the length of the
directory names (even shortened) accounted for.

[...]

>From yours:
-------       -------       ---=---       -------       ------- 
 my $startdir = './dir1';
 my %counts;

 find sub {
        return unless -f;
        return unless /^\d+$/;
        $counts{$File::Find::dir}++;
 }, $startdir;

 my( $total_dir, $total_files);
 print "# Files   Directory\n-------   ---------------\n";
 for my $dir ( sort keys %counts ) {
        (my $shortdir = $dir) =~ s{.*News/agent/nntp}{};
        my $nfiles = $counts{$dir};
        printf "%5d     %s\n", $nfiles, $shortdir;
        $total_dir++;
        $total_files += $nfiles;
 }
 print "\n<$total_files> files in <$total_dir> directories\n”;
 
-------       -------       ---=---       -------       ------- 
To this:
 Yours with slight alterations.
-------       -------       ---=---       -------       ------- 
my $startdir = '/home/gnusu/News/agent/nntp';
my %counts;

find sub {
        return unless -f;
        return unless /^\d+$/;
        $counts{$File::Find::dir}++;
}, $startdir;

my( $total_dir, $total_files);
my $gcnt = 0;

for my $dir ( sort keys %counts ) {
        $gcnt++;
        (my $shortdir = $dir) =~ s{.*News/agent/nntp}{};
        my $nfiles = $counts{$dir};
        printf "%2d: %-56s %6d\n", $gcnt, $shortdir, $nfiles;
        $total_dir++;
        $total_files += $nfiles;
}
print "\n<$total_files> posts in <$total_dir> directories\n";
-------       -------       ---=---       -------       ------- 

The outputs of my original code and yours with the alterations are the
same:

Printing out 44 directories, each with its file count and ending with
the totals.

<4261964> posts in <44> directories overall

What surprised me is the that when I ran them prefaced with the `time'
utility, I see the sloppy mess I wrote is nearly twice as fast.

I named my own code `cntNews' so named yours `hash_cntNews'

time cntNews (the code from OP):

 [...] showing last 5 lines of output
 
42: nntp.perl.org/perl/beginners                            126568
43: nntp.perl.org/perl/beginners/cgi                         13609
44: nntp.perl.org/perl/perl6/users                            3967

<4261964> posts in <44> directories overall

real    0m15.976s
user    0m10.948s
sys     0m3.416s


-------       -------       ---=---       -------       -------

time hash_cntNews (your code with slight alterations):

 [...] showing last 5 lines of output
 
42: /nntp.perl.org/perl/beginners                            126568
43: /nntp.perl.org/perl/beginners/cgi                         13609
44: /nntp.perl.org/perl/perl6/users                            3967
 
<4261964> posts in <44> directories overall

real    0m29.886s
user    0m12.892s
sys     0m12.824s


Even though I'd rather use your code since it is more reliable,
because like you say; File::Find does not guarantee mine will always
work: But, counting over 4.2 million msgs, can you guess why that time
factor might be the way it is? One is nearly twice as fast.

I did try them at least a dozen times so far... and the timings
come out either the same or nearly so each time.  The biggest difference I
saw was +- 2 seconds on each code's time. But still remained roughly
twice as fast.

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to