It depends on the order of the filters in your Analyzer. You would want
to be sure you put the StopWord filter before the Stemming filter. The
reason that the MoreLikeThis class does not do as you want is that first
it applies the Analyzer (which stems) and THEN it applies its custom
stop word removal. If you pass an Analyzer that removes stop words
before stemming, you don't have to worry about the stemming at all. The
stopword 'uninteresting' would be removed before the stemming even
occurred in the analyzer. The tokens from the analyzer would then be fed
to the MoreLikeThis stop word removal scheme...but you could just have
that list be empty as its too late anyway...you would have already done
your stop word removal with the Analyzer rather than with the
MoreLikeThis stop word removal scheme...which can only occur after an
Analyzer has been fully applied to the text. Frankly, I don't know why
MoreLikeThis supports its own stopword list...you can always do it in a
custom analyzer that you pass to MoreLikeThis, which gives you more
control of when the stopword removal is applied (say before or after
stemming). Sugar I guess.
- Mark
Donna L Gresh wrote:
I wasn't sure this:
Instead add the stopwords to the analyzer that
you pass to MoreLikeThis. That way you can ensure that the analyzer
applies the stopword list before stemming
would work, because I don't want to provide all the variants of the
stopword list-- if I do this, only the one provided will be removed,
correct?
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
Mark Miller <[EMAIL PROTECTED]> wrote on 10/15/2007 10:37:22 AM:
Sounds right to me.
The other option I think you have is to not use the MoreLikeThis
stopword functionality. Instead add the stopwords to the analyzer that
you pass to MoreLikeThis. That way you can ensure that the analyzer
applies the stopword list before stemming (The MoreLikeThis stopword
removal is implemented so that stopwords are removed after stemming).
Then you just have to add 'developer' to the stop list, and you can
forget about handling stemmed forms.
Your method should also work though.
- Mark
Donna L Gresh wrote:
Could those "in the know" comment on my current understanding of
stemming
and stopwords using the snowball analyzer?
In my application, I am using the MoreLikeThis class to find similar
documents to an input "text blob". There are words in the input text
blob
which are "uninteresting" for my application, so I create a list of
these
words. These words are "uninteresting" no matter what their tense or
usage, for example, "develop", "developing", "developed", and
"developer"
are all uninteresting and I do not want them included in the search
query
created by the MoreLikeThis class.
My index documents are stemmed using the Snowball analyzer. I do not
use
any stopwords when the documents are indexed (as I would like the
choice
of stopwords to be under user control at search time).
I would like the user to be able to provide to the search application
a
list of "uninteresting" words, and for obvious reasons would like to
force
them to provide only, say, "developer" and have the application
understand
that all variants should be ignored (and I don't want to force them to
try
to guess what the stemmed version of "developer" is).
My first try was to use MoreLikeThis with the Snowball analyzer and a
simple list of unstemmed stopwords (MoreLikeThis.setAnalyzer and
MoreLikeThis.setStopWords). However, it appears that the stopwords
provided to the MoreLikeThis class are compared in an exact way to the
token stream output by the Snowball filter (where the words have been
stemmed), so "developer" will not match anything, and all variants
pass
through. Even if I provide the list of unstemmed stopwords to the
snowball
analyzer instead, they are used "as-is" with no stemming performed, so
"developer" will not remove "developed".
Apparently the following is necessary for my application:
Construct a snowball analyzer with no stopwords. Use the unstemmed
stopword list with the analyzer to construct a stemmed version of the
set
of stopwords. Use this set of stemmed stopwords as the stopwords input
to
the MoreLikeThis class (where the tokens are compared to the stemmed
versions after been output from the Snowball analyzer).
Is my understanding correct?
Donna
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]