Hi Grant,
Thanks for the reply.
I would definitely look into Solr Deduplication approch. But since I am
using pure lucene and not Solr, I am not sure how feasible that would be to
find something in lucene or try duplicating it. But thats looks to be the
way forward.
Also regarding the question a
I'd probably treat this as a deduplication problem and look to use a fuzzy
matching approach, such as the TextProfileSignature in Solr/Nutch:
http://wiki.apache.org/solr/Deduplication, which I believe is tunable as to
it's threshold of acceptance.
I'd also likely give pushback on the notion of
Can some one pls help with the logic that can be applied to decide on the
closeness requirement given below (like 50% matching). This matching is a
pure text matching.
Since the current lucene score does not translate into the percentage of
closeness, is there anything else that can give this info