Hi John, Yeah, I'm going for near-duplicate detection. Thanks for your advice, I'll look into those algorithms and give it a go!
Cheers, -dcf On Tue, Apr 16, 2013 at 3:52 AM, John Conwell <[email protected]> wrote: > Depends on what kind of deduplication your trying to do. Do you want exact > dup detection? Or near dup detection? > > For exact dup you dont need Mahout. Just run each doc through a mapper, > where the mapper does a MD5 hash on the doc, and emit the MD5 hash value as > the mapper key, and the doc id as the mapper value. Then the reducer will > pull all the documents together that have the same MD5 hash value. > > If you want to do a near dup analysys, you can go with a ngram shingling > analysys. I dont think there is anything built into Mahout that does this, > you can use Mahout's ngram generation, and specify a very low Log > Likelyhood score so most/all of the ngrams get emitted. Then use this > ngram data in your shingling algorithm. There are several known shingling > algorithms out there, just google them, and implement. > > > > > > On Mon, Apr 15, 2013 at 6:38 AM, xdcfff <[email protected]> wrote: > > > Hi all, > > > > Just looking for some general guidance on how I would approach this task. > > > > If I have two datasets containing items, what is currently the best way > to > > detect duplicates between them using Mahout? I intend on matching based > on > > item name text similarity to begin with. > > > > I'm willing to write Java wherever necessary, but I just want to be sure > to > > avoid "re-coding the wheel" as such. > > > > Cheers, > > -dcf > > > > > > -- > > Thanks, > John C >
