Hi Marie,

On 09/11/2008 at 4:03 AM, Marie-Christine Plogmann wrote:
> I am currently using the demo class IndexFiles to index some
> corpus. I have replaced the Standard by a GermanAnalyzer.
> Here, indexing works fine.
> But if i specify a different stopword list that should be
> used, the tokenization doesn't seem to work properly. Mostly
> some letters are missing at the end. Has somebody encountered
> a similar problem? What could be the problem?

Are you sure that this only occurs after you change the stopword list?

I assume you're using the GermanAnalyzer in contrib/; it constructs an analysis 
pipeline consisting of StandardTokenizer, StandardFilter, LowerCaseFilter, 
StopFilter, and then  GermanStemFilter, which invokes GermanStemmer 
<http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_3_2/contrib/analyzers/src/java/org/apache/lucene/analysis/de/GermanStemmer.java?view=markup>,
 which is an implementation of the stemming algorithm described in the paper 
linked from here: 
<http://www.inf.fu-berlin.de/inst/pubs/tr-b-99-16.abstract.html>.

A basic question to get out of the way: Are you aware that the stemming 
operation removes letters from the end of some words?

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to