Mike,
Here's the screenshot; not sure if it will go through as an attachment
though - if not, I'll post it as a link. Please ignore the altered package
names, since Lucene is shaded in as part of our build process.
Some more context about the use case. Yes, the terms are pretty much
unique; the s
Thanks for the Google link. I wasn't aware of it. Most of it is very
intuitive. And most importantly consistent.
On Wed, Aug 27, 2014 at 11:07 AM, Jack Krupansky
wrote:
> It's not documented, but Google does seem to support trailing wildcard,
> but only if the prefix has at least six charact
It's not documented, but Google does seem to support trailing wildcard, but
only if the prefix has at least six characters. For shorter prefixes, it
seems to just drop the wildcard.
Google also uses "*" in quoted phrases to mean a placeholder for any single
term. That's documented.
See:
http
Thanks Jack. I'll try this out. I'll have to see if that creates other
side effects :-(. Tokenization is already causing a great deal of
confusion. I want to make it as intuitive as possible.
On Wed, Aug 27, 2014 at 10:45 AM, Jack Krupansky
wrote:
> Yes, the white space tokenizer will pres
Yes. If you search for alphare on google and alphare*, you get 2 different
results. Sorry for the contrived example. I just tried searching for
alpharetta and went backwards deleting characters.
On Wed, Aug 27, 2014 at 10:01 AM, Benson Margulies
wrote:
> Does google actually support "*"?
>
>
Yes, the white space tokenizer will preserve all punctuation, but... then
the query for DevNm00* will fail. A "smarter" set of filters is probably
needed here... start with white space tokenization, keep that overall token,
then trim external punctuation and keep that token as well, and then use
Tokenization is tricky. You might consider using whitespace tokenizer
followed by word delimiter filter (instead of standard tokenizer); it
does a kind of secondary tokenization pass that can preserve the
original token in addition to its component parts. There are some weird
side effects to
Does google actually support "*"?
On Wed, Aug 27, 2014 at 9:54 AM, Milind wrote:
> I see. This is going to be extremely difficult to explain to end users.
> It doesn't work as they would expect. Some of the tokenizing rules are
> already somewhat confusing. Their expectation is that it shou
I see. This is going to be extremely difficult to explain to end users.
It doesn't work as they would expect. Some of the tokenizing rules are
already somewhat confusing. Their expectation is that it should work the
way their searches work in Google.
It's difficult enough to recognize that beca
This is surprising: unless you have an excessive number of unique
fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
Bu you only have 12 unique fields?
Can you post screen shots of the heap usage?
Mike McCandless
http://blog.mikemccandless.com
On Tue, Aug 26, 2014 at 3:53 PM, V
10 matches
Mail list logo