I have set of documents separated in to doc_sections (d) that are again separated in to (n) number of sentences. There is an ontology that I’m using to calculate similarity between definitions of ontology terms vs doc_sections. The documents are indexed at sentence level, so each sentence is a document in the Lucene index.Each ontology term definition is indexed as separate document in the same lucene index.
These are my use cases 1) I want to calculate similarity(Okapi, Cosine) between doc_section(i) vs doc_section(j) and similarity between doc_section(j) vs ontology definitions. Now each sentence itself is a document in Lucene index, so I will be calculating TF and IDF for a collection. TF is specific to each document and will not change for a collection, but what is the way to calculate IDF for a collection (IDF for not each document but IDF value for collection) This is the reason why I indexed at sentence level 2) I would select some specific sentences, and consider those sentences as new_document. Then I want to calculate similarity between newly_created_document vs doc_sections, and newly_created_document vs ontology definitions. Here also each sentence is a document in lucene index, so basically I want to calculate TF-IDF for a collection (IDF for not each document but one IDF value for collection), how can this be done? This can be done easily by creating the index again with custom documents. I don’t want to re-create the index again and again with newly created documents; it would be much computational intensive. -- Regards Kasun Perera