Also...

You can use Digest::MD5 module and create an MD5 signature for comparing the
files that have the same size.


Teddy


----- Original Message ----- 
From: "Xavier Noria" <[EMAIL PROTECTED]>
To: "beginners perl" <beginners@perl.org>
Sent: Saturday, July 23, 2005 10:46 AM
Subject: Re: File Management


> On Jul 23, 2005, at 7:56, Joel Divekar wrote:
>
> > We have a windoz based file server with thousand of
> > user accounts. Each user is having thousand of files
> > in his home directory. Most of these files are
> > duplicate / modified or updated version of the
> > existing files. These files are either .doc or . xls
> > or .ppt files which are shared by groups or
> > departments.
> >
> > Due to this my server is having terabyte of data, most
> > of which are redundant and our sysadmin has tough time
> > maintaining storage space.
> >
> > For this I though of writing a small program to locate
> > similar or duplicate files stored on my file server
> > and delete them with the help of the user. The program
> > should work very fast and I don't know from where to
> > start.
>
> Well, to come with the right solution one would need to play around a
> bit in the server. I propose an approach based on the description
> above, just in case it helps.
>
> Since there is big number of files, we need to walk the tree at least
> once, and store some data for each file to compare, I would choose a
> quick test first that speeds up the tree traversal as much as
> possible, purges the tree, and then do heavier operations on the
> remaining candidates.
>
> For instance:
>
>      1. Walk the tree and build a map using -s
>
>             size -> filenames
>
>      2. Purge the entries that have just one filename associated, since
>         they have no duplicate for sure
>
>      3. Work on the rest of the entries.
>
> If the map in (1) gets too big to fit in a hash in memory you could
> use some sort of database table, maybe something simple to setup as
> SQLite. For (3), if the number of candidates is still not small you
> could make an additional refinement constructing a map with MD5s,
> until you get a small number of files and can compare their contents.
>
> Trace as less as possible the tree traversal, printing to the console
> a debug line for each file, for instance, would slow down the script
> by orders of magnitude.
>
> Then, to maintain that tree, I don't know, maybe the time to do this
> is assumable? Running that procedure periodically might be a simple
> but good enough solution.
>
> -- fxn
>
> -- 
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to