We have a large set of documents that we would like to index with a customized
stopword list. We have run tests by indexing a random set of about 10% of the
documents, and we'd like to generate a list of the terms in that smaller set
and their IDF values as a way to create a starter set of stopwords for the
larger document set by selecting the terms that have the lowest IDF values.
First of all, is this the best way to create a stopword list? Second, is there
a straightforward way to generate a list of terms and their IDF values from a
Lucene index?
Thanks,
Mike