> I'd like to use lucene to search text > documents for the existence of a large > list of search terms. I have a file that contains thousands > of entries, one > word per line. I was thinking about to writing a > specialized analyzer > that tokenizes the document by looking up each token > in the source document > in my list of words and return terms for words that exist > in my file. I'm > hoping that using this approach the index file will contain > only items that > exist in my document.
Sounds like KeepWordFilter[1][2] is what you are looking for. keepwords.txt will be the file that contains thousands of entries, one word per line. And just as you guessed using this approach, the index will contain only items that exist in your document (keepwords.txt). I can share the code to use this TokenFilter in Lucene if you want. Or alternatively you can easily copy and paste KeepWordFilter.java [1]http://lucene.apache.org/solr/api/org/apache/solr/analysis/KeepWordFilter.html [2]http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeepWordFilterFactory --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org