I'm fiddling with custom anaylyzers to analyze email addresses to store the full email address and the component parts. It's based on Solr's analyzer framework, so I have a StandardTokenizerFactory followed by a EmailFilterFactory. It produces

Analyzing "<[EMAIL PROTECTED]>"

1: [EMAIL PROTECTED]:1->31:<EMAIL>]
2: [humphrey:1->9:<EMAIL>]
3: [bogart:10->16:<EMAIL>]
4: [casablanca:17->27:<EMAIL>]
5: [com:28->31:<EMAIL>]

I set the start/end offset to be the length of the component, but in the LIA book listing 4.6 shows the start/end offsets for the synonyms as the same as the original token, whereas I set my start/end as the correct start/end for the length and offset of the part.

LIA says these are not used in Lucene - is that still the case for 2.1 and does this matter?

Thanks
Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to