Hi Simon,
I guess in a sense we are interested in obtaining a list of the top N terms, 
but they would be the top terms in the sense that they have the lowest IDF 
values. These would be the terms that appear in all or almost all documents in 
the document set. This is not a count of the number of term occurrences in 
documents, it is a count of documents that contain at least one occurrence of a 
given term. Lucene must be storing IDF values for the terms of a document set 
somewhere in order to compute TF/IDF values when searching. I am wondering if 
there is an easy way to iterate through all of the terms that occur in the 
document set and obtain their IDF values.
Thanks,
Mike

-----Original Message-----
From: Simon Willnauer [mailto:simon.willna...@googlemail.com] 
Sent: Thursday, December 15, 2011 11:44 AM
To: java-user@lucene.apache.org
Subject: Re: Obtaining IDF values for the terms in a document set

On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary <tmole...@uw.edu> wrote:
> We have a large set of documents that we would like to index with a 
> customized stopword list. We have run tests by indexing a random set of about 
> 10% of the documents, and we'd like to generate a list of the terms in that 
> smaller set and their IDF values as a way to create a starter set of 
> stopwords for the larger document set by selecting the terms that have the 
> lowest IDF values. First of all, is this the best way to create a stopword 
> list? Second, is there a straightforward way to generate a list of terms and 
> their IDF values from a Lucene index?
> Thanks,
> Mike

hey mike,

I can certainly help you with generating the list of your top N terms, if that 
is the best or right way to generate the stopwords list I am not sure but maybe 
somebody else will step up.

to get the top N terms out of your index you can simply iterate the terms in a 
field and put the top N terms based on the docFreq() on a heap. something like 
this:

     static class TermAndDF {
       String term;
       int df;
     }
     int queueSize = N;
     PriorityQueue<TermAndDF> queue = ...

     final TermEnum termEnum = reader.terms(new Term(field));
      try {
        do {
          final Term term = termEnum.term();
          if (term == null || term.field() != field) break;
          int docFreq = termEnum.docFreq();
          if (queue.size() < queueSize) {
             queue.add(new TermAndDF(term.text(), docFreq);
          } else if (queue.top().df < docFreq) {
             TermAndFreq tnFrq = queue.top();
             tnFrq.term = term.text();
             tnFrq.df = docFreq;
          }
        } while (termEnum.next());
      } finally {
        termEnum.close();
      }

another way of doing it is to use index pruning and drop terms with docFreq 
above a threshold after you have indexed your doc set.

simon

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to