Re: how to do a `cp` on millions of files

Steve Bertrand Mon, 28 Sep 2009 22:54:47 -0700

Uri Guttman wrote:
>>>>>> "SB" == Steve Bertrand <st...@ibctech.ca> writes:
> 
>   SB> Perhaps there is a Perl way to do it, but otherwise, for 250GB of data,
>   SB> research dump/restore, and test it out (after making a backup).
> 
>   SB> imho, you shouldn't use another layer of abstraction for managing such a
>   SB> large volume of data, unless you are attempting to create some sort of
>   SB> index for it.
> 
> i want to back up steve here. no way perl will ever handle that much
> data in anything like the time a dedicated dump/rsync/etc could
> do. those are optimized and written in c just for that job. perl would
> be massively slower. it should be easy enough to just benchmark perl's
> File::Copy vs unix cp on a large file. multiply that by that many files
> and you will easily see the problem here.


After I received Uri's post to the list, I briefly removed myself from
what I was doing, and wrote the following code (s/ode/rap). I wanted to
test it for myself. I don't remember the last time I used backtics, but
I did here, just to see what would happen.

The results of the bench follow the __END__.


#!/usr/bin/perl

use warnings;
use strict;

use File::Copy::Recursive qw ( dircopy );

use Benchmark qw( :all );

my $directory   = './files';
my $backup      = './backup';

mkdir $backup if ! -e $backup;

generate_files() if ! -e $directory;

my $results
    = timethese( 10, {
                  'rsync'   => sub { `rsync -arc $directory $backup` },
                  'perl-cp' => sub { dircopy( $directory, $backup ) },

                } );

cmpthese $results;


sub generate_files {

        mkdir $directory if ! -e $directory;

        my $file_size = '1m';

        for my $ext ( 1..1000) {

                my $file_to_create = "file.${ext}";

                `mkfile $file_size $directory/$file_to_create`;
        }
}

__END__

amanda# ./bench.pl

Benchmark: timing 10 iterations of perl-cp, rsync...

   perl-cp: 418 wallclock secs ( 3.15 usr 34.51 sys +  0.00 cusr  0.01
csys = 37.66 CPU) @  0.27/s (n=10)

     rsync: 493 wallclock secs ( 0.00 usr  0.00 sys + 67.80 cusr 19.63
csys = 87.44 CPU)

          s/iter              perl-cp                rsync
perl-cp     3.77                   --                -100%
rsync   1.00e-16 3765625000000000000%                   --


...from what I can tell, if I'm interpreting the results correctly, it
appears as though rsync does a bit better. The data was ( as noted in
the code ) 1000, 1MB files, all located within a single directory.

I ran this test ranging from count 1 through count 20, and the results
were essentially the same.

In essence, unless testing rsync within Perl is causing mixed results,
don't use Perl to back-up or copy large amounts of data, period.

Steve

smime.p7s
Description: S/MIME Cryptographic Signature

Re: how to do a `cp` on millions of files

Reply via email to