Re: File Management

Xavier Noria Sat, 23 Jul 2005 00:46:39 -0700

On Jul 23, 2005, at 7:56, Joel Divekar wrote:

We have a windoz based file server with thousand of
user accounts. Each user is having thousand of files
in his home directory. Most of these files are
duplicate / modified or updated version of the
existing files. These files are either .doc or . xls
or .ppt files which are shared by groups or
departments.


Due to this my server is having terabyte of data, most
of which are redundant and our sysadmin has tough time
maintaining storage space.

For this I though of writing a small program to locate
similar or duplicate files stored on my file server
and delete them with the help of the user. The program
should work very fast and I don't know from where to
start.

Well, to come with the right solution one would need to play around abit in the server. I propose an approach based on the descriptionabove, just in case it helps.

Since there is big number of files, we need to walk the tree at leastonce, and store some data for each file to compare, I would choose aquick test first that speeds up the tree traversal as much aspossible, purges the tree, and then do heavier operations on theremaining candidates.


For instance:

    1. Walk the tree and build a map using -s

           size -> filenames

    2. Purge the entries that have just one filename associated, since
       they have no duplicate for sure

    3. Work on the rest of the entries.

If the map in (1) gets too big to fit in a hash in memory you coulduse some sort of database table, maybe something simple to setup asSQLite. For (3), if the number of candidates is still not small youcould make an additional refinement constructing a map with MD5s,until you get a small number of files and can compare their contents.

Trace as less as possible the tree traversal, printing to the consolea debug line for each file, for instance, would slow down the scriptby orders of magnitude.

Then, to maintain that tree, I don't know, maybe the time to do thisis assumable? Running that procedure periodically might be a simplebut good enough solution.


-- fxn

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: File Management

Reply via email to