RE: easy way to figure out most common tokens?

Uwe Schindler Wed, 15 Aug 2012 11:48:10 -0700

If you found the terms to remove with e.g. HighFreqTerms, you can use the
abstract class FilterIndexReader (FilterAtomicReader in Lucene 4.0) to code
a filter for the term dictionary (just return a filtered TermEnum) on
merging. Just wrap an IndexReader with this FilterIndexReader that hides the
terms and then do IndexWriter.addIndexes(filteredReader) to a new, empty
index. This still needs time, but maybe better than reindexing.


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Shaya Potter [mailto:spot...@gmail.com]
> Sent: Wednesday, August 15, 2012 8:43 PM
> To: java-user@lucene.apache.org
> Cc: Erick Erickson
> Subject: Re: easy way to figure out most common tokens?
> 
> On 08/15/2012 02:29 PM, Erick Erickson wrote:
> > I don't see how you could without indexing everything first since you
> > can't know what the most frequent terms until you've processed all
> > your documents....
> 
> exactly
> 
> > If you know these terms in advance, it seems like you could just call
> > then stopwords and use the common stopword processing.
> >
> > If you have to examine your corpus in the first place, it seems like
> > you could do something with term frequencies to extract the most
> > common terms from your index then re-index all your data with those
> > terms as stopwords..
> 
> its a possibility, but that would require reindexing, which would take a
long
> time, hence my desire to try and edit the individual documents.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: easy way to figure out most common tokens?

Reply via email to