Thanks for your reply. I will answer your questions below one by one: What is the bigger goal you are trying to get at?
I am trying to implement a "Search System" where we have basically 2 input queries and output should be chain of documents which connects the 2 input queries. These chain of documents should be ranked according to relevance.This has following steps: 1.Create 2 end point document sets (for 2 user queries) 2. Expand queries (using centroid terms) 3. Run that expanded query and get an intermediate document set(which will have documents that would connect end points) 4. Having 2 end points and an intermediate set, try to find the connections(doc-doc similarities) and get the pathways(chain of documents that connect 1st and 2nd query). So a chain can have maximum of 4 and minimum of 1( doc same in all 3 sets which is the highly relevant for 2 input queries) 5. We have in mind specific applications which could use this set up and right now doing a prototype system using citeseer documents as sample data set which is around a million documents. I am done with the steps 1, 2 and 3. And trying to get the 4th module done. What do you need the similarity score for? This will be used to rank documents and make pathways (which is the ultimate goal) Do you need to compare every item in set 1 against every item in set 2? Yes I should do this . Like i said i have 3 sets :Left set for user query 1, Right set for user query 2, and intermediate set for expanded query. First i have to do n X n comparison for all unique docs in intermediate set and store compared doc ids , score info in a data structure. Then take set 1 docs and compare it with intermediate set and look for docs having scores more than a fixed threshold, do this for Intermediate set and right side set also. At the end we can come up with the relevant chain of documents. In your initial reply there was a qn : What do the doc ids have to do with the content? , the reason why i asked this question was i was wondering if there is some implementation which already exists(to do a doc-doc comparision based on doc-id's which would be a cool one for users.) Grant Ingersoll-6 wrote: > > You would have write it, as it doesn't exist in Lucene (but could be > a useful contribution). The easiest version is probably the cosine > similarity, described at http://en.wikipedia.org/wiki/Vector_space_model > > Essentially, you have two vectors, and you need to calculate the > cosine of the angle between them. Then you can do this for each one > in your set. > > What is the bigger goal you are trying to get at? What do you need > the similarity score for? Do you need to compare every item in set 1 > against every item in set 2? > > On Aug 19, 2007, at 11:19 PM, Lokeya wrote: > >> >> Hi, >> >> Thanks for your reply. >> >> I can use the getTermFreqVector() on Index Reader and get it. But I am >> wondering whats the API which has to be used to find the similarity >> between >> 2 such vectors which would give a score (doc-doc similairty in >> essence). >> >> Thanks. >> >> >> >> Grant Ingersoll-6 wrote: >>> >>> Hi, >>> >>> >>> On Aug 16, 2007, at 2:20 PM, Lokeya wrote: >>> >>>> >>>> Hi All, >>>> >>>> I have the following set up: a) Indexed set of docs. b) Ran 1st >>>> query and >>>> got tops docs c) Fetched the id's from that and stored in a data >>>> structure. >>>> d) Ran 2nd query , got top docs , fetched id's and stored in a data >>>> structure. >>>> >>>> Now i have 2 sets of doc ids (set 1) and (set 1). >>>> >>>> I want to find out the document content similarity between these 2 >>>> sets(just >>>> using doc ids information which i have). >>>> >>> >>> Not sure what you mean here. What do the doc ids have to do with the >>> content? >>> >>>> Qn 1: Is it possible using any lucene api's. In that case can you >>>> point me >>>> to the appropriate API's. I did a search at >>>> :http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ >>>> javadoc/index.html >>>> But couldn't find anything. >>>> >>> >>> It is possible if you use Term Vectors (see >>> IndexReader.getTermFreqVector). You will need to store (when you >>> construct your Field) and load the term vectors and then calculate >>> the similarity. A common way of doing this is by calculating the >>> cosine of the angle between the two vectors. >>> >>> -Grant >>> >>> -------------------------- >>> Grant Ingersoll >>> http://lucene.grantingersoll.com >>> >>> Lucene Helpful Hints: >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance >>> http://wiki.apache.org/lucene-java/LuceneFAQ >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >> >> -- >> View this message in context: http://www.nabble.com/Document- >> Similarities-lucene%28particularly-using-doc-id%27s%29- >> tf4281286.html#a12229492 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > -------------------------- > Grant Ingersoll > http://lucene.grantingersoll.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Document-Similarities-lucene%28particularly-using-doc-id%27s%29-tf4281286.html#a12238336 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]