On Mar 28, 2007, at 3:51 AM, Antony Bowesman wrote:
I'm fiddling with custom anaylyzers to analyze email addresses to
store the full email address and the component parts. It's based
on Solr's analyzer framework, so I have a StandardTokenizerFactory
followed by a EmailFilterFactory. It produces
Analyzing "<[EMAIL PROTECTED]>"
1: [EMAIL PROTECTED]:1->31:<EMAIL>]
2: [humphrey:1->9:<EMAIL>]
3: [bogart:10->16:<EMAIL>]
4: [casablanca:17->27:<EMAIL>]
5: [com:28->31:<EMAIL>]
I set the start/end offset to be the length of the component, but
in the LIA book listing 4.6 shows the start/end offsets for the
synonyms as the same as the original token, whereas I set my start/
end as the correct start/end for the length and offset of the part.
LIA says these are not used in Lucene - is that still the case for
2.1 and does this matter?
They aren't used implicitly by anything in Lucene, but can be very
handy for efficient highlighting. Where you set the offsets really
all depends on how you plan on using the offset values. In the
synonym example you mention, if the original word is "dog" and the
user searched for "canine", to properly highlight the word "dog" in
the original text the offsets for "canine" need to be where "dog" is.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]