Re: Which stemmer?

2012-11-15 Thread Michael Sokolov
On 11/15/2012 1:06 PM, Tom Burton-West wrote: This paper on the Kstem stemmer lists cases where the Porter stemmer understems or overstems and explains the logic of Kstem: "Viewing Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth Annual International ACM SIGIR Con

Re: Which stemmer?

2012-11-15 Thread Michael Sokolov
On 11/15/2012 1:06 PM, Tom Burton-West wrote: This paper on the Kstem stemmer lists cases where the Porter stemmer understems or overstems and explains the logic of Kstem: "Viewing Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth Annual International ACM SIGIR Con

Re: Which stemmer?

2012-11-15 Thread Jack Krupansky
One other factor to keep in mind is that the customer should never "look" at the actual stem term - such as "countri" or "gener" because in can freak them out a little, for no good reason. I mean, the goal of stemming is to show what set of words/terms will be treated as equivalent on a query, a

RE: ConstantScoreRangeQuery returns wrong results

2012-11-15 Thread Uwe Schindler
Hi, doing String-based constant score range queries is slow and the problem you see are caused by the fact that a range query between 2 *string* terms find all terms between the 2 endpoints. As there may be ipv6 addresses with the same string prefix, they are in the range. So: 1. Use NumericFi

comparing documents in 2 indexes

2012-11-15 Thread Elshaimaa Ali
Hi all I have a problem that might be very trivial but I don't know how can I solve it using Lucene I created an index with Lucene for a huge data set around 3 million documents in various domains and another index for a corpus of 30 documents in a specific domain.for every document in the smal

RE: Which stemmer?

2012-11-15 Thread Scott Smith
Thanks for the suggestions I think Erick is correct as well. I'll let the customer decide. Here's an updated list. Fyi--the minStem was the English Minimal Stemmer--I changed the label. Interesting to see where the minimal stemmer and porter agree (and KStemmer doesn't). You may also find t

search influenced by token attributes

2012-11-15 Thread Masanz, James J.
We have been reading that there are new flexible indexing capabilities in Lucene 4.0. This seems very promising and useful for what we're trying to do, but we can't find documentation on exactly how to implement something. Here's our problem setting: we're trying to incorporate attributes onto

Re: Which stemmer?

2012-11-15 Thread Tom Burton-West
I agree with Erick that you probably need to give your client a list of concrete examples, and perhaps to explain the trade-offs. All stemmers both overstem and understem. Understemming means that some forms of a word won’t get searched. For example, without stemming, searching for “dogs” would

Re: Which stemmer?

2012-11-15 Thread Erick Erickson
I'd make it easy for myself. Generate (programmatically), a list like you showed for a _lot_ more terms, send it to your customer, and let _them_ pick. Unfortunately, the customer has no idea what "aggressive" means (for that matter, I don't know how porter handles specific words for that matter, I

Re: content disappears in the index

2012-11-15 Thread Erick Erickson
Oddly I had the exact same thought. Although it's not obvious from the name (and common usage) of trim-like functions that you'd also have a way to specify maximum length (after trimming I'd assume). And the other thought I had was that TrimFilter should optionally take a list of characters to tri