i have a thesis work which i have done. it was on lega documents. the XML IR 
systems are very susceptible for producing duplicate or near duplicate contents 
(not in concept, but in textual content ).
here is what i did .
i tag each article content in the legal documents, with their status, and their 
relationship with other article contents.
 and then write a parer that will read this tags and index the contents 
therein. 
and write a re-ranking algorithm that works based on the staus information of 
article contents and their relationship.
for example 
the one that contians an active law will be boosted, because it is the active 
law that prevails in matters.
and some times article are replaced by other articles , in this case rather 
than presenting both them (which can result in duplicates ) i compare new terms 
used in the process of replacing with the query terms and boost the replacing 
article content. or i compare the terms that are exclusively used by the old 
article content with the query terms and boost the old article content. 
the repealed article contents are downwwited by some factor rather than 
presenting their them along with their other version . 
this is what i have done to reduce the number of duplicating or near 
duplicating search results to users.other wise the user can waste considerable 
time inspecting which is duplicate and which is not .
i can give u the the abstract and come codes (in java) id u want 
i think this might help 
henok 

--- On Wed, 9/16/09, syedfa <fayyazud...@gmail.com> wrote:

From: syedfa <fayyazud...@gmail.com>
Subject: Finding duplicate records from a result set
To: java-user@lucene.apache.org
Date: Wednesday, September 16, 2009, 1:52 AM


Dear Fellow Java/Lucene developers:

One annoyance that people have when searching for information online is the
occurance of duplicate records (i.e. multiple sites that carry news feeds
from the SAME news source like reuters or the associated press, and do not
provide any additional pieces of information).  This becomes an issue for
the user, as they would like to sift through all the duplicates and only
search through only the unique hits.  In my application that I am working
on, I realize that this is extremely common.  I have various books in xml
format that contain quotations, of which many are also listed in other
collections (i.e. the narrator, and the text of the quotation are almost
exactly the same.  Because the books have been translated into english by
different authors, the quotations from each collection differ slightly from
one another.  The quotations are being reported by multiple sources).  What
I would like to do in my application, using either Lucene, is to return a
set of results, such that if a user searches for a particular keyword (or
uses multiple keywords), then the result set should list any quote that is
reported from multiple sources only once, and underneath that quote, simply
list all the references from the other collections where it is found,
instead of listing the exact same quote in the result set, multiple times.
 
For example, if John Doe said, "blah blah blah", which is found in the
sources A, B, and C, if a user searched for "blah blah blah", then I want
the result set to show:
 
1. Narrator: John Doe
    Quote: "blah blah blah"
    Reference: A, B, C
 
and NOT like the following:
 
1. Narrator: John Doe
   Quote: "blah blah blah"
   Reference: A
 
2. Narrator: John Doe
   Quote: "blah blah blah"
   Reference: B
 
3. Narrator: John Doe
   Quote: "blah blah blah"
   Reference: C
 
I would imagine that this is a known issue in information retrieval, and I
am wondering if you have been able to solve/address this issue in Java using
Lucene?  What would you advise?  
 
Thanks to everyone for your time and patience.
-- 
View this message in context: 
http://www.nabble.com/Finding-duplicate-records-from-a-result-set-tp25468423p25468423.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




      

Reply via email to