Start/end offsets in analyzers

Antony Bowesman Tue, 27 Mar 2007 23:52:02 -0800

I'm fiddling with custom anaylyzers to analyze email addresses to store the fullemail address and the component parts. It's based on Solr's analyzer framework,so I have a StandardTokenizerFactory followed by a EmailFilterFactory. It produces


Analyzing "<[EMAIL PROTECTED]>"


1: [EMAIL PROTECTED]:1->31:<EMAIL>]
2: [humphrey:1->9:<EMAIL>]
3: [bogart:10->16:<EMAIL>]
4: [casablanca:17->27:<EMAIL>]
5: [com:28->31:<EMAIL>]

I set the start/end offset to be the length of the component, but in the LIAbook listing 4.6 shows the start/end offsets for the synonyms as the same as theoriginal token, whereas I set my start/end as the correct start/end for thelength and offset of the part.

LIA says these are not used in Lucene - is that still the case for 2.1 and doesthis matter?


Thanks
Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Start/end offsets in analyzers

Reply via email to