On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary <tmole...@uw.edu> wrote: > We have a large set of documents that we would like to index with a > customized stopword list. We have run tests by indexing a random set of about > 10% of the documents, and we'd like to generate a list of the terms in that > smaller set and their IDF values as a way to create a starter set of > stopwords for the larger document set by selecting the terms that have the > lowest IDF values. First of all, is this the best way to create a stopword > list? Second, is there a straightforward way to generate a list of terms and > their IDF values from a Lucene index? > Thanks, > Mike
hey mike, I can certainly help you with generating the list of your top N terms, if that is the best or right way to generate the stopwords list I am not sure but maybe somebody else will step up. to get the top N terms out of your index you can simply iterate the terms in a field and put the top N terms based on the docFreq() on a heap. something like this: static class TermAndDF { String term; int df; } int queueSize = N; PriorityQueue<TermAndDF> queue = ... final TermEnum termEnum = reader.terms(new Term(field)); try { do { final Term term = termEnum.term(); if (term == null || term.field() != field) break; int docFreq = termEnum.docFreq(); if (queue.size() < queueSize) { queue.add(new TermAndDF(term.text(), docFreq); } else if (queue.top().df < docFreq) { TermAndFreq tnFrq = queue.top(); tnFrq.term = term.text(); tnFrq.df = docFreq; } } while (termEnum.next()); } finally { termEnum.close(); } another way of doing it is to use index pruning and drop terms with docFreq above a threshold after you have indexed your doc set. simon --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org