On Mar 28, 2007, at 3:51 AM, Antony Bowesman wrote:

I'm fiddling with custom anaylyzers to analyze email addresses to store the full email address and the component parts. It's based on Solr's analyzer framework, so I have a StandardTokenizerFactory followed by a EmailFilterFactory. It produces

Analyzing "<[EMAIL PROTECTED]>"

1: [EMAIL PROTECTED]:1->31:<EMAIL>]
2: [humphrey:1->9:<EMAIL>]
3: [bogart:10->16:<EMAIL>]
4: [casablanca:17->27:<EMAIL>]
5: [com:28->31:<EMAIL>]

I set the start/end offset to be the length of the component, but in the LIA book listing 4.6 shows the start/end offsets for the synonyms as the same as the original token, whereas I set my start/ end as the correct start/end for the length and offset of the part.

LIA says these are not used in Lucene - is that still the case for 2.1 and does this matter?

They aren't used implicitly by anything in Lucene, but can be very handy for efficient highlighting. Where you set the offsets really all depends on how you plan on using the offset values. In the synonym example you mention, if the original word is "dog" and the user searched for "canine", to properly highlight the word "dog" in the original text the offsets for "canine" need to be where "dog" is.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to