On 09/24/13 00:12, Dr.Ruud wrote:
I assume this is about paths and filenames. Have you considered an rsync dry-run?
I use "rsync -n ..." frequently.
I also assume that you want to communicate as little as possible, so you don't have supersets of all strings on all sides. (or it would become a simple indexing problem) I also assume that you are more interested in missing items, so hash-value collisions are not a problem.
My use-case is ~100k files. I'm looking for a hash function that will have few, if any, collisions.
I also assume that the set of string1 is smaller than that of string2, let's say 100 vs. 10000 different values.
string1 and string2 can be anywhere from the empty string to the entire contents of files; the largest file I have is ~12 GB.
For local deduplication, you would store paths as a directory name and a parent-index: #table=path #columns=id,name,pid 1,"",0 2,"usr",1 3."local",2 And then have a list of filenames, and per filename in which path it exists. #table=file #columns=id,name #table=detail #columns=file_id,path_id,size,md5 For combining index values, use something like: ( i1 << N ) | i2. (where N is the number of bits needed by i2)
Where did you find "( i1 << N ) | i2" for MD5?
I would not involve string concatenation: keep things separate once separated. Use arrays.
I would prefer comparing two files by comparing two digests, rather than comparing two arrays of digests.
Use (parts of) md5's of strings, if you need to compare to remote locations.
I use all of the digest.
So best first explain *more* now about what you try to solve. A single or multiple computers, connected or not? Suppose 1 computer sends a concise email about what it has, such that the other computer can reply with an even conciser email about what it has, and what it needs. IOW: diff+patch.
I'd like the application(s) to work over SSH, similar to rsync. David -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/