[jira] [Resolved] (LUCENE-6348) Incorrect results from UAX_URL_EMAIL tokenizer

Steve Rowe (JIRA) Fri, 06 Mar 2015 15:00:06 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Rowe resolved LUCENE-6348.
--------------------------------
    Resolution: Not a Problem
      Assignee: Steve Rowe

Hi Benji,

This is the intended behavior.

Before LUCENE-5897/LUCENE-5400 were committed in Lucene 4.9.1, tokenization 
rules could match any length tokens, and ones that were larger than 
max_token_length would simply be (silently) dropped, not truncated.

>From Lucene 4.9.1 onward, StandardTokenizer and UAX29URLEmailTokenizer rules 
>are not allowed to match against more than max_token_length characters, so URL 
>prefixes will match, but the non-matched remaining characters of the URL will 
>be subject to all of the other tokenization rules, resulting in behavior like 
>you're seeing.

To get the behavior you want, increase the max_token_length to the maximum 
token length you expect to encounter, then add a TruncateTokenFilter, set to 
truncate tokens to your current max_token_length. 

> Incorrect results from UAX_URL_EMAIL tokenizer
> ----------------------------------------------
>
>                 Key: LUCENE-6348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6348
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>         Environment: Elasticsearch 1.3.4 on Ubuntu 14.04.2
>            Reporter: Benji Smith
>            Assignee: Steve Rowe
>
> I'm using an analyzer based on the UAX_URL_EMAIL, with a maximum token length 
> of 64 characters. I expect the analyzer to discard any text in the URL beyond 
> those 64 characters, but the actual results yield ordinary terms from the 
> tail-end of the URL.
> For example, 
> {code}
> curl -XGET 
> http://localhost:9200/my_index/_analyze?analyzer=uax_url_email_analyzer -d 
> "hey, check out 
> http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-death-is-optional
>  for some light reading."
> {code}
> The results look like this:
> {code}
> {
>     "tokens": [
>         {
>             "token": "hey",
>             "start_offset": 0,
>             "end_offset": 3,
>             "type": "<ALPHANUM>",
>             "position": 1
>         },
>         {
>             "token": "check",
>             "start_offset": 5,
>             "end_offset": 10,
>             "type": "<ALPHANUM>",
>             "position": 2
>         },
>         {
>             "token": "out",
>             "start_offset": 11,
>             "end_offset": 14,
>             "type": "<ALPHANUM>",
>             "position": 3
>         },
>         {
>             "token": 
> "http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-d";,
>             "start_offset": 15,
>             "end_offset": 79,
>             "type": "<URL>",
>             "position": 4
>         },
>         {
>             "token": "eath",
>             "start_offset": 79,
>             "end_offset": 83,
>             "type": "<ALPHANUM>",
>             "position": 5
>         },
>         {
>             "token": "is",
>             "start_offset": 84,
>             "end_offset": 86,
>             "type": "<ALPHANUM>",
>             "position": 6
>         },
>         {
>             "token": "optional",
>             "start_offset": 87,
>             "end_offset": 95,
>             "type": "<ALPHANUM>",
>             "position": 7
>         },
>         {
>             "token": "for",
>             "start_offset": 96,
>             "end_offset": 99,
>             "type": "<ALPHANUM>",
>             "position": 8
>         },
>         {
>             "token": "some",
>             "start_offset": 100,
>             "end_offset": 104,
>             "type": "<ALPHANUM>",
>             "position": 9
>         },
>         {
>             "token": "light",
>             "start_offset": 105,
>             "end_offset": 110,
>             "type": "<ALPHANUM>",
>             "position": 10
>         },
>         {
>             "token": "reading",
>             "start_offset": 111,
>             "end_offset": 118,
>             "type": "<ALPHANUM>",
>             "position": 11
>         }
>     ]
> }
> {code}
> The term from the extracted URL is correct, and correctly truncated at 64 
> characters. But as you can see, the analysis pipeline also creates three 
> spurious terms [ "eath", "is" "optional" ] which come from the discarded 
> portion of the URL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (LUCENE-6348) Incorrect results from UAX_URL_EMAIL tokenizer

Reply via email to