near duplicates

2006-10-17 Thread Find Me

How to eliminate near duplicates from the index? Someone suggested that I
could look at the TermVectors and do a comparision to remove the duplicates.
One major problem with this is the structure of the document is no longer
important. Are there any obvious pitfalls? For example: Document A being a
subset of Document B but in no particular order.

Nutch's DeleteDuplicates class is useful only when the documents are
identical with respect to either URL or the content.


Re: near duplicates

2006-10-24 Thread Find Me

It doesn't make sense to eliminate near duplicates during search time. But
if you are trying to cluster duplicates together then probably you want to
look at Carrot.

On 10/24/06, Beto Siless <[EMAIL PROTECTED]> wrote:


Hi Andrej!

I'm taking a look to fuzzy signatures for near duplicate detection and
and I have seen your TextProfileSignature. The question is: If I index
the documents with their text signature, is there a way to filter near
duplicates at search time without comparing each document with all other?

Thanks
Beto

Andrzej Bialecki wrote:
> karl wettin wrote:
>>
>> 17 okt 2006 kl. 17.54 skrev Find Me:
>>
>>> How to eliminate near duplicates from the index?
>>
>> I would probably try to measure the Ecludian distance between all
>> documents, computed on terms and their positions. Or perhaps use
>> standard deviation to find the distribution of terms in a document.
>> One would based on the output from that try to find a threashold.
>> Either way it will consume lots of CPU.
>
>
> There are better ways to achieve this. You need to create a fuzzy
> signature of the document, based on term histogram or shingles - take a
> look a the Signature framework in Nutch.
>
> There is a substantial literature on this subject - go to Citeseer and
> run a search for "near duplicate detection".
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Lucene id generation

2006-12-11 Thread Find Me

On 12/11/06, Waheed Mohammed <[EMAIL PROTECTED]> wrote:


Hello,

Is there a way to influence lucene's generation of ids while indexing.

my requirement is. I want to have different indexes where no index should
have
ids that have been assigned to an index earlier.
for instance
IDX1 : {0.100}
IDX2: {101...200}
IDX3: {201...300}
but not
IDX1 : {0.100}
IDX2 : {0.100}
IDX3 : {0.100}



I dont think you should be doing that. If you want to have the same effect,
during searching you can package hits from different indices with a
predetermined offset for each index. For ex: IDX1 will have an offset 0,
IDX2 will have 101...and so on.

--Rajesh Munavalli


Re: Speed of grouped queries

2007-01-03 Thread Find Me

On 1/2/07, sdeck <[EMAIL PROTECTED]> wrote:



Thanks for advanced on any insight on this one.

I have a fairly large query to run, and it takes roughly 20-40 seconds to
complete the way that i have it.
here is the best example I can give.

I have a set of roughly 25K documents indexed

I have queries that get documents matching a particular actor.

Then, I have a movie query that takes all of the documents found for each
actor query and combines them all together to say, here are all documents
that are relevant for this movie.

Then, and here is the time hog, I have a genre query that says, take all
movies and get their results and combine them together into this genre
result set.



Is there any possibility to use Carrot clustering for genre? Could you
please give examples for the final complex query as well as individual
simple queries?  You can also state the aim of the query. Are you trying to
get clustered list of movies (based on genre) for a particular actor?

--Rajesh Munavalli

The problem is, at indexing time, I do not have a way to say if a document

is a particular genre, or a particular actor, or movie etc.  If I try and
say for the genre query, get all documents and then filter for the queries
for movies and actors, I get heap space memory issues.

The query for collecting a specific actor is around 200-300 milliseconds,
and the movie one, that actually queries each actor, takes roughly 500-700
milliseconds. Yet, for a genre, where you may have 50-100 movies, it takes
500 milliseconds*# of movies

Any ideas on how I could run these queries differently? For a given actor
query, there is about 5-7 boolean query clauses. Just to give some
insight.

I currently just create 1 HitSetCollector (I rolled my own
bitsetcollector)
and just run searches with it.  I just get crapped on when it does that
genre search. I wish there was an easier way to aggregate all of those
documents together from all of those searches.  After it is done, I cache
the results, but the initial hit is bad.

Any help would be much appreciated.
Sdeck



--
View this message in context:
http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8132099
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




DisjunctionMaxQuery explaination

2006-09-19 Thread Find Me

I was trying to print out the score explanation by a DisjunctionMaxQuery.
Though there is a hit score > 0 for the results, there is no detailed
explanation. Am I doing something wrong?

In the following output, each hit has two lines. The first line is the hit
score and the second line is the explanation given by the
DisjunctionMaxQuery.

Hit 1: 0.6027994
0.0 = max plus 0.1 times others of:

Hit 2: 0.59990174
0.0 = max plus 0.1 times others of:

Hit 3: 0.41993123
0.0 = max plus 0.1 times others of:


Re: DisjunctionMaxQuery explaination

2006-09-19 Thread Find Me

public void explainSearchScore(String indexLocation, DisjunctionMaxQuery
disjunctQuery){
IndexSearcher searcher = new IndexSearcher(IndexReader.open
(indexLocation));

Hits hits = searcher.search(disjunctQuery);
if(hits == null) return;

for(int i = 0; i < hits.length(); i++){
  System.out.println("Hit " + i + " " +
searcher.explain(disjunctQuery,
i).toString());
}
}


On 9/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: In the following output, each hit has two lines. The first line is the
hit
: score and the second line is the explanation given by the
: DisjunctionMaxQuery.

how are you printing the Explanation? .. are you using the toString()?

can you post a small self contained code example showing how you got this
output?

: Hit 1: 0.6027994
: 0.0 = max plus 0.1 times others of:
:
: Hit 2: 0.59990174
: 0.0 = max plus 0.1 times others of:
:
: Hit 3: 0.41993123
: 0.0 = max plus 0.1 times others of:




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
--Rajesh Munavalli
Blog: http://munavalli.blogspot.com


Re: DisjunctionMaxQuery explaination

2006-09-19 Thread Find Me

Forgot to add the hits.score() to print out the hits score.

public void explainSearchScore(String indexLocation, DisjunctionMaxQuery 
disjunctQuery){
IndexSearcher searcher = new 
IndexSearcher(IndexReader.open(indexLocation));

Hits hits = searcher.search(disjunctQuery);

if(hits == null) return;

for(int i = 0; i < hits.length(); i++){
  System.out.println("Hit " + i + ": " + hits.score(i) + 
"\n" + searcher.explain(disjunctQuery, i).toString());

}
}

Find Me wrote:
public void explainSearchScore(String indexLocation, 
DisjunctionMaxQuery disjunctQuery){
 IndexSearcher searcher = new 
IndexSearcher(IndexReader.open(indexLocation));
 
 Hits hits = searcher.search(disjunctQuery);

 if(hits == null) return;
 
 for(int i = 0; i < hits.length(); i++){
   System.out.println("Hit " + i + " " + 
searcher.explain(disjunctQuery, i).toString());

 }
}


On 9/19/06, *Chris Hostetter* <[EMAIL PROTECTED] 
<mailto:[EMAIL PROTECTED]>> wrote:



: In the following output, each hit has two lines. The first line
is the hit
: score and the second line is the explanation given by the
: DisjunctionMaxQuery.

how are you printing the Explanation? .. are you using the toString()?

can you post a small self contained code example showing how you
got this
output?

: Hit 1: 0.6027994
: 0.0 = max plus 0.1 times others of:
:
: Hit 2: 0.59990174
: 0.0 = max plus 0.1 times others of:
:
: Hit 3: 0.41993123
: 0.0 = max plus 0.1 times others of:




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery

2006-09-29 Thread Find Me

For:
BooleanQuery bQuery=new BooleanQuery();
bQuery.add(messageQuery,true,false)

Use:
BooleanQuery bQuery=new BooleanQuery();
bQuery.add(messageQuery, BooleanClause.Occur.MUST);

Mapping is as follows:

For add(query, true, false) use add(query, BooleanClause.Occur.MUST)
For add(query, false, false) use add(query, BooleanClause.Occur.SHOULD)
For add(query, false, true) use add(query, BooleanClause.Occur.MUST_NOT)

--Rajesh Munavalli


On 9/29/06, Ismail Siddiqui <[EMAIL PROTECTED]> wrote:


Hi,

I have two pharase queries


messageQuery = new PhraseQuery();
titleQuery = new PhraseQuery();
messageQuery.setSlop(3);
titleQuery.setSlop(1);

for (int i=0; i