RE: What is the proper use of stop words in Lucene?

Uwe Schindler Mon, 28 Apr 2014 12:47:31 -0700

> Hello Uwe,
> 
> Thank you for the reply. I see that there is a version check for the 
> use of setEnablePositionIncrements(false); and, I think I may be able 
> to use an earlier api with the eXist-db embedding of Lucene 4.4 to 
> avoid the version check.


Hi,

you don't need an older version of the Lucene library. It is enough to pass the 
constant, also with Lucene 4.7 or 4.8 (release in a moment):
sf = new StopFilter(Version.LUCENE_43, ...); sf. setEnablePositionIncrements 
(false);

The version constant is exactly to use some components that changed in an 
incompatible way still in later versions, and preserve index/behavior 
compatibility.

About stop words: What you are doing, is not really "stop words". The main 
reason for stop words is the following:
- Stop words are in almost every document, so it makes no sense to query for 
them.
- The only relevant information behind the stop word is "there was a word at 
this position that"
If the second item would not be taken care, this information would get lost, 
too.

If every document really contains a specific stop word (which is almost always 
the case), there must be no difference between a phrase query with mentioned 
stop word, using an index with all stop words indexed and one with stop words 
left out. This can only be done, if the stop word reserves a position.

What you intend to do is not a "stopword" use case. You want to "ignore" some 
words - Lucene has no support for this, because in native language processing 
this makes no sense. One way to do this is to:
a) write your own TokenFilter, violating the TokenStream contracts
b) use the Backwards compatibility layer with matchVersion=LUCENE_43
c) maybe remove the words before tokenizing (e.g. MappingCharFilter, mapping 
the "ignore words" to empty string)

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: What is the proper use of stop words in Lucene?

Reply via email to