Babak if your filter splits a token into two output tokens, i think you want to adjust the end offset of the first output token, and the start offset of the second.
Babak, for a fairly simple example of this, you can look at the ThaiWordFilter in the lucene contrib-analyzers package. it has to break input tokens into subtokens and correct offsets... sounds like you are on the right track though. On Fri, Nov 13, 2009 at 10:30 PM, Babak Farhang <farh...@gmail.com> wrote: > Thanks for your explanations. I think I have a basic understanding now. > > What I'm not so sure about, now, is how to decide on the start and > ending offsets when the TokenFilter implementation wants to break an > input token into subtokens. Should the offsets of the emitted > subtokens be the same as the original input token? Should I only have > highlighting in mind when setting these offsets, or are there other > things to consider (e.g. impact on search)? > > I'll check out some of the contrib filters and Solr's > WordDelimiterFilter to see how they handle this. But if you know any > rules of thumb I should follow please share.. > > -Babak > > PS Hope this kind of follow-up question is not considered bad etiquette. > > On Fri, Nov 13, 2009 at 4:20 PM, Robert Muir <rcm...@gmail.com> wrote: > > Another example is if you used a stemmer, it might change the termLength: > > (walking -> walk), but the offsets of the original unstemmed word > (walking) > > stay the same. > > > > On Fri, Nov 13, 2009 at 6:01 PM, Uwe Schindler <u...@thetaphi.de> wrote: > > > >> This is not coupled because: > >> > >> termLength() is the number of chars in the term buffer, where the > offsets > >> give the offsets in the orginal char stream. If you use a CharFilter to > >> e.g. > >> remove chars, the termLength will get shorter, but the offset are still > the > >> original ones. Also both things are indexed in different ways, the > >> termLength and offsets have no relation and must (as said before) not > even > >> follow a contract like end-start=length. > >> > >> ----- > >> Uwe Schindler > >> H.-H.-Meier-Allee 63, D-28213 Bremen > >> http://www.thetaphi.de > >> eMail: u...@thetaphi.de > >> > >> > -----Original Message----- > >> > From: Babak Farhang [mailto:farh...@gmail.com] > >> > Sent: Friday, November 13, 2009 11:50 PM > >> > To: java-user@lucene.apache.org > >> > Subject: Redundant fields Token class? > >> > > >> > I'm writing a TokenFilter and am confused about why class Token has > >> > both an *endOffset* and a *termLength* field. It would appear that > >> > the following invariant should always hold for a Token instance: > >> > > >> > termLength() == endOffset() - startOffset() > >> > > >> > If so, then > >> > > >> > 1) Why 2 fields, instead of 1? > >> > 2) Why isn't the invariant enforced in the class? > >> > > >> > -Babak > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com