Re: Bigrams for CJK with ICUTokenizer ?

Robert Muir Fri, 04 Feb 2011 12:20:06 -0800

On Fri, Feb 4, 2011 at 3:07 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Thanks Robert,
>
> Lucene 2740 looks really interesting.  In the meantime a JIRA issue for this 
> sounds like a good idea since I'm guessing other people would like to use the 
> ICUTokenizer but would also like bigrams for CJK.
>
> I'm a bit confused over the relationship of the queryparser to the filter 
> chain and whether a filter in the chain after the ICUTokenizer could 
> construct bigrams if the ICUTokenizer is spitting out unigrams and the 
> queryparser is then converting the unigrams to a Boolean clauses (i.e. 
> autoGeneratePhraseQueries=false.)


the QP only sees two things:
1. the input string, which it parses before the analyzer
2. the result of the entire analyzer (tokenizer and all filters).

So in this case, only #2 would be different, as the entire analyzer
would output AB, BC instead of A, B, C
With your settings, for an input of ABC, you will get a regular
boolean query with AB, BC.
If the user puts "ABC" in quotes though, you will get a phrase query of "AB BC"

>
> If ABC is a string of Han characters and the ICUTokenizer spit out unigrams A 
> B C  (and we have autoGeneratePhraseQueries set to false) won't the next 
> filter in the chain get each of the unigrams in a Boolean clause one at a 
> time?  I guess I don't see how the next filter in the chain can reassemble 
> the unigrams into overlapping bigrams.   Maybe I'm not understanding how 
> tokens get passed from one filter to the next when one of the filters (or in 
> this case the tokenizer) breaks a token up into multiple tokens.

In this case it works just like a selective shinglefilter?

>
> Or am I getting index time analysis confused with query time analysis?
> Did you mean that ICUTokenizer could be modified to output bigrams  or that a 
> filter could be designed that would take the output of the ICUTokenizer and 
> create shingles on tokens with the attribute for Han?
>

I think the latter. this way, we can provide the most options: unigram
(what it does by default: A,B,C), but also filters for bigram (AB BC),
or unibigram  (A, AB, B, BC, C)
This is why i said, we can make these filters experimental for now,
because ideally at some point you will be able to use shinglefilter
"conditionally" over the ScriptAttribute for these use-cases, without
having to have a special filter.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Bigrams for CJK with ICUTokenizer ?

Reply via email to