i have a thesis work which i have done. it was on lega documents. the XML IR systems are very susceptible for producing duplicate or near duplicate contents (not in concept, but in textual content ). here is what i did . i tag each article content in the legal documents, with their status, and their relationship with other article contents. and then write a parer that will read this tags and index the contents therein. and write a re-ranking algorithm that works based on the staus information of article contents and their relationship. for example the one that contians an active law will be boosted, because it is the active law that prevails in matters. and some times article are replaced by other articles , in this case rather than presenting both them (which can result in duplicates ) i compare new terms used in the process of replacing with the query terms and boost the replacing article content. or i compare the terms that are exclusively used by the old article content with the query terms and boost the old article content. the repealed article contents are downwwited by some factor rather than presenting their them along with their other version . this is what i have done to reduce the number of duplicating or near duplicating search results to users.other wise the user can waste considerable time inspecting which is duplicate and which is not . i can give u the the abstract and come codes (in java) id u want i think this might help henok
--- On Wed, 9/16/09, syedfa <fayyazud...@gmail.com> wrote: From: syedfa <fayyazud...@gmail.com> Subject: Finding duplicate records from a result set To: java-user@lucene.apache.org Date: Wednesday, September 16, 2009, 1:52 AM Dear Fellow Java/Lucene developers: One annoyance that people have when searching for information online is the occurance of duplicate records (i.e. multiple sites that carry news feeds from the SAME news source like reuters or the associated press, and do not provide any additional pieces of information). This becomes an issue for the user, as they would like to sift through all the duplicates and only search through only the unique hits. In my application that I am working on, I realize that this is extremely common. I have various books in xml format that contain quotations, of which many are also listed in other collections (i.e. the narrator, and the text of the quotation are almost exactly the same. Because the books have been translated into english by different authors, the quotations from each collection differ slightly from one another. The quotations are being reported by multiple sources). What I would like to do in my application, using either Lucene, is to return a set of results, such that if a user searches for a particular keyword (or uses multiple keywords), then the result set should list any quote that is reported from multiple sources only once, and underneath that quote, simply list all the references from the other collections where it is found, instead of listing the exact same quote in the result set, multiple times. For example, if John Doe said, "blah blah blah", which is found in the sources A, B, and C, if a user searched for "blah blah blah", then I want the result set to show: 1. Narrator: John Doe Quote: "blah blah blah" Reference: A, B, C and NOT like the following: 1. Narrator: John Doe Quote: "blah blah blah" Reference: A 2. Narrator: John Doe Quote: "blah blah blah" Reference: B 3. Narrator: John Doe Quote: "blah blah blah" Reference: C I would imagine that this is a known issue in information retrieval, and I am wondering if you have been able to solve/address this issue in Java using Lucene? What would you advise? Thanks to everyone for your time and patience. -- View this message in context: http://www.nabble.com/Finding-duplicate-records-from-a-result-set-tp25468423p25468423.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org