Re: Start/end offsets in analyzers

Erik Hatcher Wed, 28 Mar 2007 04:41:01 -0800


On Mar 28, 2007, at 3:51 AM, Antony Bowesman wrote:

I'm fiddling with custom anaylyzers to analyze email addresses tostore the full email address and the component parts. It's basedon Solr's analyzer framework, so I have a StandardTokenizerFactoryfollowed by a EmailFilterFactory. It produces
Analyzing "<[EMAIL PROTECTED]>"

1: [EMAIL PROTECTED]:1->31:<EMAIL>]
2: [humphrey:1->9:<EMAIL>]
3: [bogart:10->16:<EMAIL>]
4: [casablanca:17->27:<EMAIL>]
5: [com:28->31:<EMAIL>]
I set the start/end offset to be the length of the component, butin the LIA book listing 4.6 shows the start/end offsets for thesynonyms as the same as the original token, whereas I set my start/end as the correct start/end for the length and offset of the part.
LIA says these are not used in Lucene - is that still the case for2.1 and does this matter?

They aren't used implicitly by anything in Lucene, but can be veryhandy for efficient highlighting. Where you set the offsets reallyall depends on how you plan on using the offset values. In thesynonym example you mention, if the original word is "dog" and theuser searched for "canine", to properly highlight the word "dog" inthe original text the offsets for "canine" need to be where "dog" is.


        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Start/end offsets in analyzers

Reply via email to