If you found the terms to remove with e.g. HighFreqTerms, you can use the abstract class FilterIndexReader (FilterAtomicReader in Lucene 4.0) to code a filter for the term dictionary (just return a filtered TermEnum) on merging. Just wrap an IndexReader with this FilterIndexReader that hides the terms and then do IndexWriter.addIndexes(filteredReader) to a new, empty index. This still needs time, but maybe better than reindexing.
----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Shaya Potter [mailto:spot...@gmail.com] > Sent: Wednesday, August 15, 2012 8:43 PM > To: java-user@lucene.apache.org > Cc: Erick Erickson > Subject: Re: easy way to figure out most common tokens? > > On 08/15/2012 02:29 PM, Erick Erickson wrote: > > I don't see how you could without indexing everything first since you > > can't know what the most frequent terms until you've processed all > > your documents.... > > exactly > > > If you know these terms in advance, it seems like you could just call > > then stopwords and use the common stopword processing. > > > > If you have to examine your corpus in the first place, it seems like > > you could do something with term frequencies to extract the most > > common terms from your index then re-index all your data with those > > terms as stopwords.. > > its a possibility, but that would require reindexing, which would take a long > time, hence my desire to try and edit the individual documents. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org