Re: Document Similarities lucene(particularly using doc id's)

Lokeya Mon, 20 Aug 2007 08:42:58 -0700

Thanks for your reply. I will answer your questions below one by one:

What is the bigger goal you are trying to get at?

I am trying to implement a "Search System" where we have basically 2 input
queries and output should be chain of documents which connects the 2 input
queries. These chain of documents should be ranked according to
relevance.This has following steps:

1.Create 2 end point document sets (for 2 user queries)
2. Expand queries (using centroid terms)
3. Run that expanded query and get an intermediate document set(which will
have documents that would connect end points)
4. Having 2 end points and an intermediate set, try to find the
connections(doc-doc similarities) and get the pathways(chain of documents
that connect 1st and 2nd query). So a chain can have maximum of 4 and
minimum of 1( doc same in all 3 sets which is the highly relevant for 2
input queries)
5. We have in mind specific applications which could use this set up and
right now doing a prototype system using citeseer documents as sample data
set which is around a million documents.

I am done with the steps 1, 2 and 3. And trying to get the 4th module done.

What do you need  the similarity score for? 

This will be used to rank documents and make pathways (which is the ultimate
goal)

Do you need to compare every item in set 1 against every item in set 2?

Yes I should do this . Like i said i have 3 sets :Left set for user query 1,
Right set for user query 2, and intermediate set for expanded query. First i
have to do n X n comparison for all unique docs in intermediate set and
store compared doc ids , score info in a data structure. Then take set 1
docs and compare it with intermediate set and look for docs having scores
more than a  fixed threshold, do this for Intermediate set and right side
set also. At the end we can come up with the relevant chain of documents.

In your initial reply there was a qn :

What do the doc ids have to do with the content? , the reason why i asked
this question was i was wondering if there is some implementation which
already exists(to do a doc-doc comparision based on doc-id's which would be
a cool one for users.)

Grant Ingersoll-6 wrote:
> 
> You would have write it, as it doesn't exist in Lucene (but could be  
> a useful contribution).  The easiest version is probably the cosine  
> similarity, described at http://en.wikipedia.org/wiki/Vector_space_model
> 
> Essentially, you have two vectors, and you need to calculate the  
> cosine of the angle between them.  Then you can do this for each one  
> in your set.
> 
> What is the bigger goal you are trying to get at?  What do you need  
> the similarity score for?  Do you need to compare every item in set 1  
> against every item in set 2?
> 
> On Aug 19, 2007, at 11:19 PM, Lokeya wrote:
> 
>>
>> Hi,
>>
>> Thanks for your reply.
>>
>> I can use the getTermFreqVector() on Index Reader and get it. But I am
>> wondering whats the API which has to be used to find the similarity  
>> between
>> 2 such vectors which would give a score (doc-doc similairty in   
>> essence).
>>
>> Thanks.
>>
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>> Hi,
>>>
>>>
>>> On Aug 16, 2007, at 2:20 PM, Lokeya wrote:
>>>
>>>>
>>>> Hi All,
>>>>
>>>> I have the following set up: a) Indexed set of docs. b) Ran 1st
>>>> query and
>>>> got tops docs  c) Fetched the id's from that and stored in a data
>>>> structure.
>>>> d) Ran 2nd query , got top docs , fetched id's and stored in a data
>>>> structure.
>>>>
>>>> Now i have 2 sets of doc ids (set 1) and (set 1).
>>>>
>>>> I want to find out the document content similarity between these 2
>>>> sets(just
>>>> using doc ids information which i have).
>>>>
>>>
>>> Not sure what you mean here.  What do the doc ids have to do with the
>>> content?
>>>
>>>> Qn 1: Is it possible using any lucene api's. In that case can you
>>>> point me
>>>> to the appropriate API's. I did a search at
>>>> :http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/
>>>> javadoc/index.html
>>>> But couldn't find anything.
>>>>
>>>
>>> It is possible if you use Term Vectors (see
>>> IndexReader.getTermFreqVector).  You will need to store (when you
>>> construct your Field) and load the term vectors and then calculate
>>> the similarity.  A common way of doing this is by calculating the
>>> cosine of the angle between the two vectors.
>>>
>>> -Grant
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucene.grantingersoll.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>
>> -- 
>> View this message in context: http://www.nabble.com/Document- 
>> Similarities-lucene%28particularly-using-doc-id%27s%29- 
>> tf4281286.html#a12229492
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
> 
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Document-Similarities-lucene%28particularly-using-doc-id%27s%29-tf4281286.html#a12238336
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Document Similarities lucene(particularly using doc id's)

Reply via email to