On Fri, Feb 4, 2011 at 3:07 PM, Burton-West, Tom <tburt...@umich.edu> wrote: > Thanks Robert, > > Lucene 2740 looks really interesting. In the meantime a JIRA issue for this > sounds like a good idea since I'm guessing other people would like to use the > ICUTokenizer but would also like bigrams for CJK. > > I'm a bit confused over the relationship of the queryparser to the filter > chain and whether a filter in the chain after the ICUTokenizer could > construct bigrams if the ICUTokenizer is spitting out unigrams and the > queryparser is then converting the unigrams to a Boolean clauses (i.e. > autoGeneratePhraseQueries=false.)
the QP only sees two things: 1. the input string, which it parses before the analyzer 2. the result of the entire analyzer (tokenizer and all filters). So in this case, only #2 would be different, as the entire analyzer would output AB, BC instead of A, B, C With your settings, for an input of ABC, you will get a regular boolean query with AB, BC. If the user puts "ABC" in quotes though, you will get a phrase query of "AB BC" > > If ABC is a string of Han characters and the ICUTokenizer spit out unigrams A > B C (and we have autoGeneratePhraseQueries set to false) won't the next > filter in the chain get each of the unigrams in a Boolean clause one at a > time? I guess I don't see how the next filter in the chain can reassemble > the unigrams into overlapping bigrams. Maybe I'm not understanding how > tokens get passed from one filter to the next when one of the filters (or in > this case the tokenizer) breaks a token up into multiple tokens. In this case it works just like a selective shinglefilter? > > Or am I getting index time analysis confused with query time analysis? > Did you mean that ICUTokenizer could be modified to output bigrams or that a > filter could be designed that would take the output of the ICUTokenizer and > create shingles on tokens with the attribute for Han? > I think the latter. this way, we can provide the most options: unigram (what it does by default: A,B,C), but also filters for bigram (AB BC), or unibigram (A, AB, B, BC, C) This is why i said, we can make these filters experimental for now, because ideally at some point you will be able to use shinglefilter "conditionally" over the ScriptAttribute for these use-cases, without having to have a special filter. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org