near duplicates
How to eliminate near duplicates from the index? Someone suggested that I could look at the TermVectors and do a comparision to remove the duplicates. One major problem with this is the structure of the document is no longer important. Are there any obvious pitfalls? For example: Document A being a subset of Document B but in no particular order. Nutch's DeleteDuplicates class is useful only when the documents are identical with respect to either URL or the content.
Re: near duplicates
It doesn't make sense to eliminate near duplicates during search time. But if you are trying to cluster duplicates together then probably you want to look at Carrot. On 10/24/06, Beto Siless <[EMAIL PROTECTED]> wrote: Hi Andrej! I'm taking a look to fuzzy signatures for near duplicate detection and and I have seen your TextProfileSignature. The question is: If I index the documents with their text signature, is there a way to filter near duplicates at search time without comparing each document with all other? Thanks Beto Andrzej Bialecki wrote: > karl wettin wrote: >> >> 17 okt 2006 kl. 17.54 skrev Find Me: >> >>> How to eliminate near duplicates from the index? >> >> I would probably try to measure the Ecludian distance between all >> documents, computed on terms and their positions. Or perhaps use >> standard deviation to find the distribution of terms in a document. >> One would based on the output from that try to find a threashold. >> Either way it will consume lots of CPU. > > > There are better ways to achieve this. You need to create a fuzzy > signature of the document, based on term histogram or shingles - take a > look a the Signature framework in Nutch. > > There is a substantial literature on this subject - go to Citeseer and > run a search for "near duplicate detection". > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene id generation
On 12/11/06, Waheed Mohammed <[EMAIL PROTECTED]> wrote: Hello, Is there a way to influence lucene's generation of ids while indexing. my requirement is. I want to have different indexes where no index should have ids that have been assigned to an index earlier. for instance IDX1 : {0.100} IDX2: {101...200} IDX3: {201...300} but not IDX1 : {0.100} IDX2 : {0.100} IDX3 : {0.100} I dont think you should be doing that. If you want to have the same effect, during searching you can package hits from different indices with a predetermined offset for each index. For ex: IDX1 will have an offset 0, IDX2 will have 101...and so on. --Rajesh Munavalli
Re: Speed of grouped queries
On 1/2/07, sdeck <[EMAIL PROTECTED]> wrote: Thanks for advanced on any insight on this one. I have a fairly large query to run, and it takes roughly 20-40 seconds to complete the way that i have it. here is the best example I can give. I have a set of roughly 25K documents indexed I have queries that get documents matching a particular actor. Then, I have a movie query that takes all of the documents found for each actor query and combines them all together to say, here are all documents that are relevant for this movie. Then, and here is the time hog, I have a genre query that says, take all movies and get their results and combine them together into this genre result set. Is there any possibility to use Carrot clustering for genre? Could you please give examples for the final complex query as well as individual simple queries? You can also state the aim of the query. Are you trying to get clustered list of movies (based on genre) for a particular actor? --Rajesh Munavalli The problem is, at indexing time, I do not have a way to say if a document is a particular genre, or a particular actor, or movie etc. If I try and say for the genre query, get all documents and then filter for the queries for movies and actors, I get heap space memory issues. The query for collecting a specific actor is around 200-300 milliseconds, and the movie one, that actually queries each actor, takes roughly 500-700 milliseconds. Yet, for a genre, where you may have 50-100 movies, it takes 500 milliseconds*# of movies Any ideas on how I could run these queries differently? For a given actor query, there is about 5-7 boolean query clauses. Just to give some insight. I currently just create 1 HitSetCollector (I rolled my own bitsetcollector) and just run searches with it. I just get crapped on when it does that genre search. I wish there was an easier way to aggregate all of those documents together from all of those searches. After it is done, I cache the results, but the initial hit is bad. Any help would be much appreciated. Sdeck -- View this message in context: http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8132099 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
DisjunctionMaxQuery explaination
I was trying to print out the score explanation by a DisjunctionMaxQuery. Though there is a hit score > 0 for the results, there is no detailed explanation. Am I doing something wrong? In the following output, each hit has two lines. The first line is the hit score and the second line is the explanation given by the DisjunctionMaxQuery. Hit 1: 0.6027994 0.0 = max plus 0.1 times others of: Hit 2: 0.59990174 0.0 = max plus 0.1 times others of: Hit 3: 0.41993123 0.0 = max plus 0.1 times others of:
Re: DisjunctionMaxQuery explaination
public void explainSearchScore(String indexLocation, DisjunctionMaxQuery disjunctQuery){ IndexSearcher searcher = new IndexSearcher(IndexReader.open (indexLocation)); Hits hits = searcher.search(disjunctQuery); if(hits == null) return; for(int i = 0; i < hits.length(); i++){ System.out.println("Hit " + i + " " + searcher.explain(disjunctQuery, i).toString()); } } On 9/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : In the following output, each hit has two lines. The first line is the hit : score and the second line is the explanation given by the : DisjunctionMaxQuery. how are you printing the Explanation? .. are you using the toString()? can you post a small self contained code example showing how you got this output? : Hit 1: 0.6027994 : 0.0 = max plus 0.1 times others of: : : Hit 2: 0.59990174 : 0.0 = max plus 0.1 times others of: : : Hit 3: 0.41993123 : 0.0 = max plus 0.1 times others of: -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --Rajesh Munavalli Blog: http://munavalli.blogspot.com
Re: DisjunctionMaxQuery explaination
Forgot to add the hits.score() to print out the hits score. public void explainSearchScore(String indexLocation, DisjunctionMaxQuery disjunctQuery){ IndexSearcher searcher = new IndexSearcher(IndexReader.open(indexLocation)); Hits hits = searcher.search(disjunctQuery); if(hits == null) return; for(int i = 0; i < hits.length(); i++){ System.out.println("Hit " + i + ": " + hits.score(i) + "\n" + searcher.explain(disjunctQuery, i).toString()); } } Find Me wrote: public void explainSearchScore(String indexLocation, DisjunctionMaxQuery disjunctQuery){ IndexSearcher searcher = new IndexSearcher(IndexReader.open(indexLocation)); Hits hits = searcher.search(disjunctQuery); if(hits == null) return; for(int i = 0; i < hits.length(); i++){ System.out.println("Hit " + i + " " + searcher.explain(disjunctQuery, i).toString()); } } On 9/19/06, *Chris Hostetter* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: : In the following output, each hit has two lines. The first line is the hit : score and the second line is the explanation given by the : DisjunctionMaxQuery. how are you printing the Explanation? .. are you using the toString()? can you post a small self contained code example showing how you got this output? : Hit 1: 0.6027994 : 0.0 = max plus 0.1 times others of: : : Hit 2: 0.59990174 : 0.0 = max plus 0.1 times others of: : : Hit 3: 0.41993123 : 0.0 = max plus 0.1 times others of: -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BooleanQuery
For: BooleanQuery bQuery=new BooleanQuery(); bQuery.add(messageQuery,true,false) Use: BooleanQuery bQuery=new BooleanQuery(); bQuery.add(messageQuery, BooleanClause.Occur.MUST); Mapping is as follows: For add(query, true, false) use add(query, BooleanClause.Occur.MUST) For add(query, false, false) use add(query, BooleanClause.Occur.SHOULD) For add(query, false, true) use add(query, BooleanClause.Occur.MUST_NOT) --Rajesh Munavalli On 9/29/06, Ismail Siddiqui <[EMAIL PROTECTED]> wrote: Hi, I have two pharase queries messageQuery = new PhraseQuery(); titleQuery = new PhraseQuery(); messageQuery.setSlop(3); titleQuery.setSlop(1); for (int i=0; i