[jira] [Resolved] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

Steve Rowe (JIRA) Tue, 09 Dec 2014 10:09:34 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Rowe resolved LUCENE-6103.
--------------------------------
    Resolution: Not a Problem
      Assignee: Steve Rowe

StandardTokenizer implements [the word boundary rules in Unicode 
UAX#29|http://www.unicode.org/reports/tr29/#Word_Boundaries].

The ASCII colon (and other colonicalish forms) is included in the set of 
characters matched by the 
[{{WordBreak:MidLetter}}|http://www.unicode.org/reports/tr29/#MidLetter] 
property value, which appears in [rules WB6 and 
WB7|http://www.unicode.org/reports/tr29/#WB6] - these rules forbid word breaks 
between the colon and surrounding letters.

To get what you want, you could customize the JFlex grammar used to generate 
StandardTokenizer by removing colons from the {{MidLetter}} definition used.

Another alternative is ICUTokenizer, which allows runtime 
per-orthographic-script specification of word-break rules - check out the 
factory javadocs: 
http://lucene.apache.org/core/4_9_0/analyzers-icu/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerFactory.html
 






> StandardTokenizer doesn't tokenize word:word
> --------------------------------------------
>
>                 Key: LUCENE-6103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6103
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.9
>            Reporter: Itamar Syn-Hershko
>            Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize 
> word:word and will preserve it as one token. This can be easily seen using 
> Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic 
> behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

Reply via email to