Re: Redundant fields Token class?

Babak Farhang Fri, 13 Nov 2009 23:39:36 -0800

> i think you want to adjust the end offset of the first output token,
> and the start offset of the second.


Makes sense. Thanks so much.

After thinking about this a bit more it seems I should think of the
contents of a Token's  termBuffer simply as an index (or key) into the
region of text defined by the token's starting and ending offsets.  A
token's termBuffer, then, might have no (lexical) relation whatsoever
with the region of text it references (e.g. SynonymTokenFilter).

Is that off the mark?

On Fri, Nov 13, 2009 at 9:01 PM, Robert Muir <rcm...@gmail.com> wrote:
> Babak if your filter splits a token into two output tokens,
> i think you want to adjust the end offset of the first output token,
> and the start offset of the second.
>
> Babak, for a fairly simple example of this, you can look at the
> ThaiWordFilter in the lucene contrib-analyzers package.
>
> it has to break input tokens into subtokens and correct offsets... sounds
> like you are on the right track though.
>
> On Fri, Nov 13, 2009 at 10:30 PM, Babak Farhang <farh...@gmail.com> wrote:
>
>> Thanks for your explanations. I think I have a basic understanding now.
>>
>> What I'm not so sure about, now, is how to decide on the start and
>> ending offsets when the TokenFilter implementation wants to break an
>> input token into subtokens. Should the offsets of the emitted
>> subtokens be the same as the original input token?  Should I only have
>> highlighting in mind when setting these offsets, or are there other
>> things to consider (e.g. impact on search)?
>>
>> I'll check out some of the contrib filters and Solr's
>> WordDelimiterFilter to see how they handle this. But if you know any
>> rules of thumb I should follow please share..
>>
>> -Babak
>>
>> PS Hope this kind of follow-up question is not considered bad etiquette.
>>
>> On Fri, Nov 13, 2009 at 4:20 PM, Robert Muir <rcm...@gmail.com> wrote:
>> > Another example is if you used a stemmer, it might change the termLength:
>> > (walking -> walk), but the offsets of the original unstemmed word
>> (walking)
>> > stay the same.
>> >
>> > On Fri, Nov 13, 2009 at 6:01 PM, Uwe Schindler <u...@thetaphi.de> wrote:
>> >
>> >> This is not coupled because:
>> >>
>> >> termLength() is the number of chars in the term buffer, where the
>> offsets
>> >> give the offsets in the orginal char stream. If you use a CharFilter to
>> >> e.g.
>> >> remove chars, the termLength will get shorter, but the offset are still
>> the
>> >> original ones. Also both things are indexed in different ways, the
>> >> termLength and offsets have no relation and must (as said before) not
>> even
>> >> follow a contract like end-start=length.
>> >>
>> >> -----
>> >> Uwe Schindler
>> >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> http://www.thetaphi.de
>> >> eMail: u...@thetaphi.de
>> >>
>> >> > -----Original Message-----
>> >> > From: Babak Farhang [mailto:farh...@gmail.com]
>> >> > Sent: Friday, November 13, 2009 11:50 PM
>> >> > To: java-user@lucene.apache.org
>> >> > Subject: Redundant fields Token class?
>> >> >
>> >> > I'm writing a TokenFilter and am confused about why class Token has
>> >> > both an *endOffset* and a *termLength* field.  It would appear that
>> >> > the following invariant should always hold for a Token instance:
>> >> >
>> >> >     termLength() == endOffset() - startOffset()
>> >> >
>> >> > If so, then
>> >> >
>> >> > 1) Why 2 fields, instead of 1?
>> >> > 2) Why isn't the invariant enforced in the class?
>> >> >
>> >> > -Babak
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > Robert Muir
>> > rcm...@gmail.com
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Redundant fields Token class?

Reply via email to