RE: EmailAddressAnalyzer & TokenStreams

Steven A Rowe Wed, 20 Aug 2008 16:22:30 -0700

Hi Dino,

The Lucene KeywordTokenizer is about as simple as tokenizers get - it just 
outputs its entire input as a single token:


<http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/KeywordTokenizer.java?revision=687357&view=markup>

Check out the source code for other Tokenizer descendants in the Lucene source 
for more hints.  Warning: a few of them are generated by scanner generator 
tools (JavaCC and JFlex), so the code is a bit impenetrable in places.

To set the position for a Token, call its setPositionIncrement() method.  From 
the javadocs:

    Set the position increment.  This determines the position of
    this token relative to the previous Token in a TokenStream,
    used in phrase searching.

(Read the rest of the javadoc for that method.  Go on, you know you want to.)

Good luck,
Steve

On 08/20/2008 at 12:58 PM, Dino Korah wrote:
> Hi guys,
> 
> If I am to tokenize an email address like "John Smith" <
> <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED]>  into
> 
>     [ <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED]
>     [John] [Smith] [J.Smith] [london.gb.world.net] [gb.world.net]
>     [world.net] [world] [net]
> 
> Is it possible to have a different Position increment for each of these
> tokens? If it is, could you please help me with the same sample, with
> numbers against each token.
> 
> Also could you please point me to a skeleton code for a custom Tokenizer.
> 
> Many Thanks
> Dino

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: EmailAddressAnalyzer & TokenStreams

Reply via email to