Hi, Erik, I understand your rant. :) Well, the solution I finalized
with is this, as suggested by Jake and Grant.
For those stop words, when indexing content, I will treat them as normal words.
When processing the user query, there will be normal query with stop
words skipped, and another part tha
Well, whether it's a good user experience is exactly the question. I've
spent far too much time satisfying customer (or product manager)
requests that add zero value to the product *in the user's eyes*.
And I quote:
"This is asked by some customer, who may not know what's "stop words" at
all."
wh
This is asked by some customer, who may not know what's "stop words" at all.
Jake's approach should be quite similar to what some search engine
companies are doing. It'll cost some storage, but can achieve a good
user experience.
The benefit is kind of obvious in real world. When users enter some
What's your reason for trying? The whole point of stop words is that
they should be considered "no ops". That is, they add nothing to the
semantics of whatever is being processed. I' don't understand the use
case for why you want to go outside that assumption.
Another way of asking this is "what t
I think the way I've seen it done most often is to either index some
bi-grams which
contain stop words (so "the database" and "search the" are in the index as
individual
tokens), or else to index that piece of content twice - once with stop words
removed
(and stemming, if you use it), and then agai
Don't throw away the stopwords? :-) Lucene can't score something it
doesn't know exists. I suppose you could try to get fancy w/ payloads
and add payloads if stopwords exist, but I am just thinking out loud
there.
On Mar 21, 2008, at 9:20 PM, Chris Lu wrote:
Let's say "the" is consider