On Fri, Sep 6, 2013 at 8:03 PM, Benson Margulies <ben...@basistech.com> wrote: > I'm confused by the comment about compound components here. > > If a single token fissions into multiple tokens, then what belongs in > the PositionLengthAttribute. I'm wanting to store a fraction in here! > Or is the idea to store N in the 'mother' token and then '1' in each > of the babies?
its the latter. the way its designed to work i think is illustrated best in kuromoji analyzer where it heuristically decompounds nouns: if it decompounds ABCD into AB + CD, then the tokens are AB and CD. these both have posinc=1. however (to compensate for precision issue you mentioned on the other thread), it keeps the full compound as a synonym too (there are some papers benchmarking this approach for decompounding, just think of IDF etc sorting things out). so that ABCD synonym has position increment 0, and it "sits" at the same position as the first token (AB). but it has positionLength=2, which basically keeps the information in the chain that this "synonym" spans across both AB and CD. so the output is like this: AB(posinc=1,posLength=1), ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1) --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org