Hi David, Another look at it, and I think I've pointed you to a wrong way. BLAST might not what you need. Sorry about this.
Jing On 25 Sep 2013, at 03:31, David Christensen <dpchr...@holgerdanske.com> wrote: > On 09/24/13 00:12, Dr.Ruud wrote: >> I assume this is about paths and filenames. Have you considered an rsync >> dry-run? > > I use "rsync -n ..." frequently. > > >> I also assume that you want to communicate as little as possible, so you >> don't have supersets of all strings on all sides. (or it would become a >> simple indexing problem) >> I also assume that you are more interested in missing items, so >> hash-value collisions are not a problem. > > My use-case is ~100k files. I'm looking for a hash function that will have > few, if any, collisions. > > >> I also assume that the set of string1 is smaller than that of string2, >> let's say 100 vs. 10000 different values. > > string1 and string2 can be anywhere from the empty string to the entire > contents of files; the largest file I have is ~12 GB. > > >> For local deduplication, you would store paths as a directory name and a >> parent-index: >> #table=path >> #columns=id,name,pid >> 1,"",0 >> 2,"usr",1 >> 3."local",2 >> And then have a list of filenames, and per filename in which path it >> exists. >> #table=file >> #columns=id,name >> #table=detail >> #columns=file_id,path_id,size,md5 >> For combining index values, use something like: ( i1 << N ) | i2. >> (where N is the number of bits needed by i2) > > Where did you find "( i1 << N ) | i2" for MD5? > > >> I would not involve string concatenation: keep things separate once >> separated. Use arrays. > > I would prefer comparing two files by comparing two digests, rather than > comparing two arrays of digests. > > >> Use (parts of) md5's of strings, if you need to compare to remote >> locations. > > I use all of the digest. > > >> So best first explain *more* now about what you try to solve. >> A single or multiple computers, connected or not? >> Suppose 1 computer sends a concise email about what it has, such that >> the other computer can reply with an even conciser email about what it >> has, and what it needs. IOW: diff+patch. > > I'd like the application(s) to work over SSH, similar to rsync. > > > David > > -- > To unsubscribe, e-mail: beginners-unsubscr...@perl.org > For additional commands, e-mail: beginners-h...@perl.org > http://learn.perl.org/ > > -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/