Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-27 Thread Vitaly Funstein
Mike, Here's the screenshot; not sure if it will go through as an attachment though - if not, I'll post it as a link. Please ignore the altered package names, since Lucene is shaded in as part of our build process. Some more context about the use case. Yes, the terms are pretty much unique; the s

Re: Why does this search fail?

2014-08-27 Thread Milind
Thanks for the Google link. I wasn't aware of it. Most of it is very intuitive. And most importantly consistent. On Wed, Aug 27, 2014 at 11:07 AM, Jack Krupansky wrote: > It's not documented, but Google does seem to support trailing wildcard, > but only if the prefix has at least six charact

Re: Why does this search fail?

2014-08-27 Thread Jack Krupansky
It's not documented, but Google does seem to support trailing wildcard, but only if the prefix has at least six characters. For shorter prefixes, it seems to just drop the wildcard. Google also uses "*" in quoted phrases to mean a placeholder for any single term. That's documented. See: http

Re: Why does this search fail?

2014-08-27 Thread Milind
Thanks Jack. I'll try this out. I'll have to see if that creates other side effects :-(. Tokenization is already causing a great deal of confusion. I want to make it as intuitive as possible. On Wed, Aug 27, 2014 at 10:45 AM, Jack Krupansky wrote: > Yes, the white space tokenizer will pres

Re: Why does this search fail?

2014-08-27 Thread Milind
Yes. If you search for alphare on google and alphare*, you get 2 different results. Sorry for the contrived example. I just tried searching for alpharetta and went backwards deleting characters. On Wed, Aug 27, 2014 at 10:01 AM, Benson Margulies wrote: > Does google actually support "*"? > >

Re: Why does this search fail?

2014-08-27 Thread Jack Krupansky
Yes, the white space tokenizer will preserve all punctuation, but... then the query for DevNm00* will fail. A "smarter" set of filters is probably needed here... start with white space tokenization, keep that overall token, then trim external punctuation and keep that token as well, and then use

Re: Why does this search fail?

2014-08-27 Thread Michael Sokolov
Tokenization is tricky. You might consider using whitespace tokenizer followed by word delimiter filter (instead of standard tokenizer); it does a kind of secondary tokenization pass that can preserve the original token in addition to its component parts. There are some weird side effects to

Re: Why does this search fail?

2014-08-27 Thread Benson Margulies
Does google actually support "*"? On Wed, Aug 27, 2014 at 9:54 AM, Milind wrote: > I see. This is going to be extremely difficult to explain to end users. > It doesn't work as they would expect. Some of the tokenizing rules are > already somewhat confusing. Their expectation is that it shou

Re: Why does this search fail?

2014-08-27 Thread Milind
I see. This is going to be extremely difficult to explain to end users. It doesn't work as they would expect. Some of the tokenizing rules are already somewhat confusing. Their expectation is that it should work the way their searches work in Google. It's difficult enough to recognize that beca

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-08-27 Thread Michael McCandless
This is surprising: unless you have an excessive number of unique fields, BlockTreeTermReader shouldn't be such a big RAM consumer. Bu you only have 12 unique fields? Can you post screen shots of the heap usage? Mike McCandless http://blog.mikemccandless.com On Tue, Aug 26, 2014 at 3:53 PM, V