Re: Using Lucene for Moderate Similarity Check..

Grant Ingersoll Tue, 09 Jun 2009 07:34:33 -0700

Hi Ravi,

Lucene can enable this, but you will have some work to do on top ofit. If you search the archives for record linkage (http://www.lucidimagination.com/search/?q=record+linkage) you will find a fair amount of discussion on this. Also, insomewhat shameless marketing mode, my co-author, Tom Morton, is justputting the finishing touches on a chapter in our book called TamingText (http://www.manning.com/ingersoll) which discusses some of thetechniques involved in making this stuff happen. That chapter shouldbe released in the next few weeks.

You likely can get a basic system working pretty quickly with what isin Lucene, but then the next level is often more difficult. You oftenend up with a rules system that can become brittle with thisapproach. An alternative is to apply some type of machine learningapproach. You could also look at this as a clustering problem, whichMahout (or other clustering tools) could be helpful in solving.

Finally, just know there will be a human in the loop with anyapproach. The goal is to minimize the number of matches that a personhas to check.


Hope this helps,
Grant

On Jun 9, 2009, at 4:16 AM, RaviK Thakur wrote:

Hello All,
I want to check the feasibility of using Lucene for similaritycheckbetween the two flat csv files. The actual requirement is like this:We
have two files each containing the information of customers like their
name, address, pin code etc. Some customers may be in common in boththefiles. We want to find the customer that are common in these files.But thematch should be on attribute basis. If the name of the customermatches in
one file to the name of the customer in another file, then match the
address, if it matches then match pin code and so on. But the main
consideration is that this matching is not exact. If the name ofcustomermatches say 80% then it may be termed as match. For example, ifABDUL ismatched with ABDULLAH, it should be termed as a match. In thisfashion eachrecord of one file will be matched with each record of another file.The
output of this procedure will be another file containing the matched
record.
Can anyone please suggest the applicability of lucene for thisrequirement.
May in the form of Pros n Cons.

Thanks in advance:-)
Ravi


______________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Using Lucene for Moderate Similarity Check..

Reply via email to