Hi David,

Another look at it, and I think I've pointed you to a wrong way. BLAST might 
not what you need. Sorry about this.

Jing
On 25 Sep 2013, at 03:31, David Christensen <dpchr...@holgerdanske.com> wrote:

> On 09/24/13 00:12, Dr.Ruud wrote:
>> I assume this is about paths and filenames. Have you considered an rsync
>> dry-run?
> 
> I use "rsync -n ..." frequently.
> 
> 
>> I also assume that you want to communicate as little as possible, so you
>> don't have supersets of all strings on all sides. (or it would become a
>> simple indexing problem)
>> I also assume that you are more interested in missing items, so
>> hash-value collisions are not a problem.
> 
> My use-case is ~100k files.  I'm looking for a hash function that will have 
> few, if any, collisions.
> 
> 
>> I also assume that the set of string1 is smaller than that of string2,
>> let's say 100 vs. 10000 different values.
> 
> string1 and string2 can be anywhere from the empty string to the entire 
> contents of files; the largest file I have is ~12 GB.
> 
> 
>> For local deduplication, you would store paths as a directory name and a
>> parent-index:
>> #table=path
>> #columns=id,name,pid
>> 1,"",0
>> 2,"usr",1
>> 3."local",2
>> And then have a list of filenames, and per filename in which path it
>> exists.
>> #table=file
>> #columns=id,name
>> #table=detail
>> #columns=file_id,path_id,size,md5
>> For combining index values, use something like: ( i1 << N ) | i2.
>> (where N is the number of bits needed by i2)
> 
> Where did you find "( i1 << N ) | i2" for MD5?
> 
> 
>> I would not involve string concatenation: keep things separate once
>> separated. Use arrays.
> 
> I would prefer comparing two files by comparing two digests, rather than 
> comparing two arrays of digests.
> 
> 
>> Use (parts of) md5's of strings, if you need to compare to remote
>> locations.
> 
> I use all of the digest.
> 
> 
>> So best first explain *more* now about what you try to solve.
>> A single or multiple computers, connected or not?
>> Suppose 1 computer sends a concise email about what it has, such that
>> the other computer can reply with an even conciser email about what it
>> has, and what it needs. IOW: diff+patch.
> 
> I'd like the application(s) to work over SSH, similar to rsync.
> 
> 
> David
> 
> -- 
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
> 
> 


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to