On 11/15/2012 1:06 PM, Tom Burton-West wrote:
This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing
Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the
Sixteenth Annual International ACM SIGIR Con
On 11/15/2012 1:06 PM, Tom Burton-West wrote:
This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing
Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the
Sixteenth Annual International ACM SIGIR Con
One other factor to keep in mind is that the customer should never "look" at
the actual stem term - such as "countri" or "gener" because in can freak
them out a little, for no good reason. I mean, the goal of stemming is to
show what set of words/terms will be treated as equivalent on a query, a
Hi,
doing String-based constant score range queries is slow and the problem you see
are caused by the fact that a range query between 2 *string* terms find all
terms between the 2 endpoints. As there may be ipv6 addresses with the same
string prefix, they are in the range.
So:
1. Use NumericFi
Hi all
I have a problem that might be very trivial but I don't know how can I solve it
using Lucene
I created an index with Lucene for a huge data set around 3 million documents
in various domains and another index for a corpus of 30 documents in a specific
domain.for every document in the smal
Thanks for the suggestions I think Erick is correct as well. I'll let the
customer decide.
Here's an updated list. Fyi--the minStem was the English Minimal Stemmer--I
changed the label. Interesting to see where the minimal stemmer and porter
agree (and KStemmer doesn't). You may also find t
We have been reading that there are new flexible indexing capabilities in
Lucene 4.0. This seems very promising and useful for what we're trying to do,
but we can't find documentation on exactly how to implement something.
Here's our problem setting: we're trying to incorporate attributes onto
I agree with Erick that you probably need to give your client a list of
concrete examples, and perhaps to explain the trade-offs.
All stemmers both overstem and understem. Understemming means that some
forms of a word won’t get searched. For example, without stemming, searching
for “dogs” would
I'd make it easy for myself. Generate (programmatically), a list like you
showed for a _lot_ more terms, send it to your customer, and let _them_
pick. Unfortunately, the customer has no idea what "aggressive" means (for
that matter, I don't know how porter handles specific words for that
matter, I
Oddly I had the exact same thought. Although it's not obvious from the name
(and common usage) of trim-like functions that you'd also have a way to
specify maximum length (after trimming I'd assume).
And the other thought I had was that TrimFilter should optionally take a
list of characters to tri
10 matches
Mail list logo