Hi Simon, Thank you for your reply. The document length is just an example of what I need to store. Another stat that I need is a *normalised* sum of the TF's. I can compute this using my own cache during retrieval by extending the SimilarityBase and storing the values in a cache that is used whenever the score method is invoked. However, I am trying to push this to the index in order to make it more efficient, and as I said earlier I haven't found a way to do this yet.
With regard to document length (DL) yes you are right, but unfortunately Lucene doesn't provide the raw (real) document length (as far as I know). It only provides the encoded/decoded DL. I read on the forum (and from my own experiments) that the difference in quality when implementing a similarity function using the raw DL versus implementing the same function but with Lucene's exposed (encoded/decoded) DL is not statistically significant. However, I still prefer to use the raw DL, and that's why I use the sum of the TF's in a document to cache it. h. On 4 Jan 2012, at 14:37, Simon Willnauer wrote: > Hey, > > On Wed, Jan 4, 2012 at 1:15 PM, Hany Azzam <h...@eecs.qmul.ac.uk> wrote: >> Hi, >> >> I am experimenting with the Lucene trunk (aka 4.0), especially with the new >> IndexDocValues feature. I am trying to store some query-independent >> statistics such as PageRank, etc. One stat that I am trying to store is the >> sum of all the term frequencies in a document. This can be seen as the >> document length. Is there a way to pre-compute this sum while performing the >> indexing? > > Lucene is already computing the length of the document in its > FieldInvertedState which is passed to similarity ie. look at > Similarity#computeNorms. Currently the norm value is a single byte > but I am working on exposing this via DocValues so you can store > custom data in your similarity. > > simon >> >> Thank you, >> h. >> >> >> >>> TermVectors are still available in Lucene trunk aka 4.0, we just changed >>> the implementation of them to fit the general Lucene Terms/Fields/… APIs. >>> TermVectors (if enabled in the document during indexing) are simply >>> something like a small index per document written to disk like a stored >>> field (it has nothing to do with DocValues, because you mentioned this). >>> Theoretically, you can execute a query against the small “TermVectors >>> Index” and get exactly one hit or no hit, if the query matches this >>> document. This is e.g. used for highlighting if TV are enabled. To support >>> this “TV as a small index”, the old API was removed and the new TermVectors >>> API returns the same Terms/TermsEnum/DocsEnum APIs like IndexReader for a >>> complete index, but all structures simply return one document (ID=0) and >>> corresponding term frequencies/doc frequencies. >>> >>> To have some example code how to use it, review the Lucene testcases, some >>> example: >>> >>> Terms result = >>> reader.getTermVectors(docId).terms(DocHelper.TEXT_FIELD_2_KEY); >>> assertNotNull(result); >>> assertEquals(3, result.getUniqueTermCount()); >>> TermsEnum termsEnum = result.iterator(null); >>> while(termsEnum.next() != null) { >>> String term = termsEnum.term().utf8ToString(); >>> int freq = (int) termsEnum.totalTermFreq(); >>> assertTrue(freq > 0); >>> } >>> >>> Fields results = reader.getTermVectors(docId); >>> assertTrue(results != null); >>> assertEquals("We do not have 3 term freq vectors", 3, >>> results.getUniqueFieldCount()); >>> >>> Uwe >>> >>> ----- >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: u...@thetaphi.de >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org