> i think you want to adjust the end offset of the first output token, > and the start offset of the second.
Makes sense. Thanks so much. After thinking about this a bit more it seems I should think of the contents of a Token's termBuffer simply as an index (or key) into the region of text defined by the token's starting and ending offsets. A token's termBuffer, then, might have no (lexical) relation whatsoever with the region of text it references (e.g. SynonymTokenFilter). Is that off the mark? On Fri, Nov 13, 2009 at 9:01 PM, Robert Muir <rcm...@gmail.com> wrote: > Babak if your filter splits a token into two output tokens, > i think you want to adjust the end offset of the first output token, > and the start offset of the second. > > Babak, for a fairly simple example of this, you can look at the > ThaiWordFilter in the lucene contrib-analyzers package. > > it has to break input tokens into subtokens and correct offsets... sounds > like you are on the right track though. > > On Fri, Nov 13, 2009 at 10:30 PM, Babak Farhang <farh...@gmail.com> wrote: > >> Thanks for your explanations. I think I have a basic understanding now. >> >> What I'm not so sure about, now, is how to decide on the start and >> ending offsets when the TokenFilter implementation wants to break an >> input token into subtokens. Should the offsets of the emitted >> subtokens be the same as the original input token? Should I only have >> highlighting in mind when setting these offsets, or are there other >> things to consider (e.g. impact on search)? >> >> I'll check out some of the contrib filters and Solr's >> WordDelimiterFilter to see how they handle this. But if you know any >> rules of thumb I should follow please share.. >> >> -Babak >> >> PS Hope this kind of follow-up question is not considered bad etiquette. >> >> On Fri, Nov 13, 2009 at 4:20 PM, Robert Muir <rcm...@gmail.com> wrote: >> > Another example is if you used a stemmer, it might change the termLength: >> > (walking -> walk), but the offsets of the original unstemmed word >> (walking) >> > stay the same. >> > >> > On Fri, Nov 13, 2009 at 6:01 PM, Uwe Schindler <u...@thetaphi.de> wrote: >> > >> >> This is not coupled because: >> >> >> >> termLength() is the number of chars in the term buffer, where the >> offsets >> >> give the offsets in the orginal char stream. If you use a CharFilter to >> >> e.g. >> >> remove chars, the termLength will get shorter, but the offset are still >> the >> >> original ones. Also both things are indexed in different ways, the >> >> termLength and offsets have no relation and must (as said before) not >> even >> >> follow a contract like end-start=length. >> >> >> >> ----- >> >> Uwe Schindler >> >> H.-H.-Meier-Allee 63, D-28213 Bremen >> >> http://www.thetaphi.de >> >> eMail: u...@thetaphi.de >> >> >> >> > -----Original Message----- >> >> > From: Babak Farhang [mailto:farh...@gmail.com] >> >> > Sent: Friday, November 13, 2009 11:50 PM >> >> > To: java-user@lucene.apache.org >> >> > Subject: Redundant fields Token class? >> >> > >> >> > I'm writing a TokenFilter and am confused about why class Token has >> >> > both an *endOffset* and a *termLength* field. It would appear that >> >> > the following invariant should always hold for a Token instance: >> >> > >> >> > termLength() == endOffset() - startOffset() >> >> > >> >> > If so, then >> >> > >> >> > 1) Why 2 fields, instead of 1? >> >> > 2) Why isn't the invariant enforced in the class? >> >> > >> >> > -Babak >> >> > >> >> > --------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> > >> > -- >> > Robert Muir >> > rcm...@gmail.com >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Robert Muir > rcm...@gmail.com > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org