Hi Marie, On 09/11/2008 at 4:03 AM, Marie-Christine Plogmann wrote: > I am currently using the demo class IndexFiles to index some > corpus. I have replaced the Standard by a GermanAnalyzer. > Here, indexing works fine. > But if i specify a different stopword list that should be > used, the tokenization doesn't seem to work properly. Mostly > some letters are missing at the end. Has somebody encountered > a similar problem? What could be the problem?
Are you sure that this only occurs after you change the stopword list? I assume you're using the GermanAnalyzer in contrib/; it constructs an analysis pipeline consisting of StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter, and then GermanStemFilter, which invokes GermanStemmer <http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_3_2/contrib/analyzers/src/java/org/apache/lucene/analysis/de/GermanStemmer.java?view=markup>, which is an implementation of the stemming algorithm described in the paper linked from here: <http://www.inf.fu-berlin.de/inst/pubs/tr-b-99-16.abstract.html>. A basic question to get out of the way: Are you aware that the stemming operation removes letters from the end of some words? Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]