Is there an easy way to figure out the most common tokens and then
remove those tokens from the documents.
use case: imagine one is indexing a mailing list (such as this
java-user) and is extracting all e-mail addresses in the messages and
adding them to a doc.
What that means is that one will be a lot of
java-user-unsubscr...@lucene.apache.org
java-user-h...@lucene.apache.org
due to that being in the signature of each email.
while, the best approach might be to not put it in the index in the
first place, I'm wondering if there's a good way to process the index
after the fact to remove these type of entries.
thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org