easy way to figure out most common tokens?

Shaya Potter Wed, 15 Aug 2012 10:47:09 -0700

Is there an easy way to figure out the most common tokens and thenremove those tokens from the documents.

use case: imagine one is indexing a mailing list (such as thisjava-user) and is extracting all e-mail addresses in the messages andadding them to a doc.


What that means is that one will be a lot of

java-user-unsubscr...@lucene.apache.org
java-user-h...@lucene.apache.org

due to that being in the signature of each email.

while, the best approach might be to not put it in the index in thefirst place, I'm wondering if there's a good way to process the indexafter the fact to remove these type of entries.


thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

easy way to figure out most common tokens?

Reply via email to