Also... You can use Digest::MD5 module and create an MD5 signature for comparing the files that have the same size.
Teddy ----- Original Message ----- From: "Xavier Noria" <[EMAIL PROTECTED]> To: "beginners perl" <beginners@perl.org> Sent: Saturday, July 23, 2005 10:46 AM Subject: Re: File Management > On Jul 23, 2005, at 7:56, Joel Divekar wrote: > > > We have a windoz based file server with thousand of > > user accounts. Each user is having thousand of files > > in his home directory. Most of these files are > > duplicate / modified or updated version of the > > existing files. These files are either .doc or . xls > > or .ppt files which are shared by groups or > > departments. > > > > Due to this my server is having terabyte of data, most > > of which are redundant and our sysadmin has tough time > > maintaining storage space. > > > > For this I though of writing a small program to locate > > similar or duplicate files stored on my file server > > and delete them with the help of the user. The program > > should work very fast and I don't know from where to > > start. > > Well, to come with the right solution one would need to play around a > bit in the server. I propose an approach based on the description > above, just in case it helps. > > Since there is big number of files, we need to walk the tree at least > once, and store some data for each file to compare, I would choose a > quick test first that speeds up the tree traversal as much as > possible, purges the tree, and then do heavier operations on the > remaining candidates. > > For instance: > > 1. Walk the tree and build a map using -s > > size -> filenames > > 2. Purge the entries that have just one filename associated, since > they have no duplicate for sure > > 3. Work on the rest of the entries. > > If the map in (1) gets too big to fit in a hash in memory you could > use some sort of database table, maybe something simple to setup as > SQLite. For (3), if the number of candidates is still not small you > could make an additional refinement constructing a map with MD5s, > until you get a small number of files and can compare their contents. > > Trace as less as possible the tree traversal, printing to the console > a debug line for each file, for instance, would slow down the script > by orders of magnitude. > > Then, to maintain that tree, I don't know, maybe the time to do this > is assumable? Running that procedure periodically might be a simple > but good enough solution. > > -- fxn > > -- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > <http://learn.perl.org/> <http://learn.perl.org/first-response> > > -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>