On 09/24/13 00:12, Dr.Ruud wrote:
I assume this is about paths and filenames. Have you considered an rsync
dry-run?

I use "rsync -n ..." frequently.


I also assume that you want to communicate as little as possible, so you
don't have supersets of all strings on all sides. (or it would become a
simple indexing problem)
I also assume that you are more interested in missing items, so
hash-value collisions are not a problem.

My use-case is ~100k files. I'm looking for a hash function that will have few, if any, collisions.


I also assume that the set of string1 is smaller than that of string2,
let's say 100 vs. 10000 different values.

string1 and string2 can be anywhere from the empty string to the entire contents of files; the largest file I have is ~12 GB.


For local deduplication, you would store paths as a directory name and a
parent-index:
#table=path
#columns=id,name,pid
1,"",0
2,"usr",1
3."local",2
And then have a list of filenames, and per filename in which path it
exists.
#table=file
#columns=id,name
#table=detail
#columns=file_id,path_id,size,md5
For combining index values, use something like: ( i1 << N ) | i2.
(where N is the number of bits needed by i2)

Where did you find "( i1 << N ) | i2" for MD5?


I would not involve string concatenation: keep things separate once
separated. Use arrays.

I would prefer comparing two files by comparing two digests, rather than comparing two arrays of digests.


Use (parts of) md5's of strings, if you need to compare to remote
locations.

I use all of the digest.


So best first explain *more* now about what you try to solve.
A single or multiple computers, connected or not?
Suppose 1 computer sends a concise email about what it has, such that
the other computer can reply with an even conciser email about what it
has, and what it needs. IOW: diff+patch.

I'd like the application(s) to work over SSH, similar to rsync.


David

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to