Hi Robert, On Mar 15, 2013, at 11:29 AM, Robert Muir <rcm...@gmail.com> wrote: > 2013/2/28 Steve Rowe <sar...@gmail.com>: >> EnglishAnalyzer has used PorterStemmer instead of the English Snowball >> stemmer since it was created in 2010 as part of LUCENE-2055[2]. I think >> this is an oversight: EnglishAnalyzer should incorporate the best English >> stemmer we've got, and Martin Porter says the Porter2 stemmer is better[1]. >> Robert Muir (who wrote EnglishAnalyzer), if you're reading, what do you >> think? > > This was intentional actually. The default was a tradeoff of > "benefits" (which affect less than 5% of english vocabulary, if you > read around the snowball site), versus a much more significant > performance difference as a "default". > > For example when i did tests of indexing both short and long texts > > http://find.searchhub.org/document/c1d3301b71dab5ca#46a8351089a98aec > > Thats overall indexing speed, not just text analysis. > > It might be that this guy is faster these days (we've done some > improvements) too.
Thanks for the explanation. I ran a lucene/benchmark alg comparing the two stemmers on trunk on my Macbook Pro with Oracle Java 1.7.0_13, and it looks like the situation hasn't changed much. The original-algorithm Porter stemmer is 4 times faster than the Porter2/English Snowball stemmer, resulting in 40% higher throughput in a full English analysis pipeline. So the default English stemmer choice is still valid IMO. Here's porter-comparison.alg: ----- content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource doc.tokenized=false doc.body.tokenized=true docs.dir=reuters-out -AnalyzerFactory(name:original-porter-stemmer,StandardTokenizer, StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter, PorterStemFilter) -AnalyzerFactory(name:porter2-stemmer,StandardTokenizer, StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter, SnowballPorterFilter(language:English)) -AnalyzerFactory(name:no-stemmer,StandardTokenizer, StandardFilter,EnglishPossessiveFilter,LowerCaseFilter,StopFilter) { "Rounds" -NewAnalyzer(original-porter-stemmer) -ResetInputs { "Original Porter Stemmer" { ReadTokens > : 20000 } -NewAnalyzer(porter2-stemmer) -ResetInputs { "Porter2/English Stemmer" { ReadTokens > : 20000 } -NewAnalyzer(no-stemmer) -ResetInputs { "No Stemmer" { ReadTokens > : 20000 } NewRound } : 5 RepSumByNameRound ----- And the results (regrouped; ordered by elapsedSec) - a "rec" is a token: ----- Operation round recsPerRun rec/s elapsedSec No Stemmer 2 1814029 1,234,873.38 1.47 No Stemmer 4 1814029 1,234,873.38 1.47 No Stemmer 1 1814029 1,230,684.50 1.47 No Stemmer 0 1814029 1,227,353.88 1.48 No Stemmer 3 1814029 1,226,524.00 1.48 Original Porter Stemmer 1 1814029 1,074,025.50 1.69 Original Porter Stemmer 4 1814029 1,065,196.12 1.70 Original Porter Stemmer 2 1814029 1,056,510.75 1.72 Original Porter Stemmer 3 1814029 1,030,698.31 1.76 Original Porter Stemmer 0 1814029 685,833.25 2.64 Porter2/English Stemmer 4 1814029 768,656.38 2.36 Porter2/English Stemmer 2 1814029 764,123.44 2.37 Porter2/English Stemmer 1 1814029 758,056.44 2.39 Porter2/English Stemmer 3 1814029 758,056.44 2.39 Porter2/English Stemmer 0 1814029 716,158.31 2.53 ----- Best of 5 results: No Stemmer: 1.47s Original Porter Stemmer: 1.69s - 1.47s = 0.22s Porter2/English Stemmer: 2.36s - 1.47s = 0.89s Throughput increase: (2.36s-1.69s)/1.69s * 100 = 40% Steve --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org