In Japanese, compounds are just decompositions of the input string. In other languages, compounds can manufacture entire tokens from thin air. In those cases, it's something of a question how to decide on the offsets. I think that you're right, eventually, insofar as there's some offset in the original that might as well be blamed for any given component.
On Fri, Sep 6, 2013 at 9:37 PM, Robert Muir <rcm...@gmail.com> wrote: > On Fri, Sep 6, 2013 at 9:32 PM, Benson Margulies <ben...@basistech.com> wrote: >> On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir <rcm...@gmail.com> wrote: >>> its the latter. the way its designed to work i think is illustrated >>> best in kuromoji analyzer where it heuristically decompounds nouns: >>> >>> if it decompounds ABCD into AB + CD, then the tokens are AB and CD. >>> these both have posinc=1. >>> however (to compensate for precision issue you mentioned on the other >>> thread), it keeps the full compound as a synonym too (there are some >>> papers benchmarking this approach for decompounding, just think of IDF >>> etc sorting things out). >>> so that ABCD synonym has position increment 0, and it "sits" at the >>> same position as the first token (AB). but it has positionLength=2, >>> which basically keeps the information in the chain that this "synonym" >>> spans across both AB and CD. >>> >>> so the output is like this: AB(posinc=1,posLength=1), >>> ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1) >> >> I suppose this works best if you actually know the offsets of the >> pieces. In disassembling German, this is not always straightforward. >> > > i dont really see how it has anything to do with natural languages? > its just the way you represent the compound components in the > tokenstream. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org