[
https://issues.apache.org/jira/browse/LUCENE-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Rowe resolved LUCENE-6348.
--------------------------------
Resolution: Not a Problem
Assignee: Steve Rowe
Hi Benji,
This is the intended behavior.
Before LUCENE-5897/LUCENE-5400 were committed in Lucene 4.9.1, tokenization
rules could match any length tokens, and ones that were larger than
max_token_length would simply be (silently) dropped, not truncated.
>From Lucene 4.9.1 onward, StandardTokenizer and UAX29URLEmailTokenizer rules
>are not allowed to match against more than max_token_length characters, so URL
>prefixes will match, but the non-matched remaining characters of the URL will
>be subject to all of the other tokenization rules, resulting in behavior like
>you're seeing.
To get the behavior you want, increase the max_token_length to the maximum
token length you expect to encounter, then add a TruncateTokenFilter, set to
truncate tokens to your current max_token_length.
> Incorrect results from UAX_URL_EMAIL tokenizer
> ----------------------------------------------
>
> Key: LUCENE-6348
> URL: https://issues.apache.org/jira/browse/LUCENE-6348
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Environment: Elasticsearch 1.3.4 on Ubuntu 14.04.2
> Reporter: Benji Smith
> Assignee: Steve Rowe
>
> I'm using an analyzer based on the UAX_URL_EMAIL, with a maximum token length
> of 64 characters. I expect the analyzer to discard any text in the URL beyond
> those 64 characters, but the actual results yield ordinary terms from the
> tail-end of the URL.
> For example,
> {code}
> curl -XGET
> http://localhost:9200/my_index/_analyze?analyzer=uax_url_email_analyzer -d
> "hey, check out
> http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-death-is-optional
> for some light reading."
> {code}
> The results look like this:
> {code}
> {
> "tokens": [
> {
> "token": "hey",
> "start_offset": 0,
> "end_offset": 3,
> "type": "<ALPHANUM>",
> "position": 1
> },
> {
> "token": "check",
> "start_offset": 5,
> "end_offset": 10,
> "type": "<ALPHANUM>",
> "position": 2
> },
> {
> "token": "out",
> "start_offset": 11,
> "end_offset": 14,
> "type": "<ALPHANUM>",
> "position": 3
> },
> {
> "token":
> "http://edge.org/conversation/yuval_noah_harari-daniel_kahneman-d",
> "start_offset": 15,
> "end_offset": 79,
> "type": "<URL>",
> "position": 4
> },
> {
> "token": "eath",
> "start_offset": 79,
> "end_offset": 83,
> "type": "<ALPHANUM>",
> "position": 5
> },
> {
> "token": "is",
> "start_offset": 84,
> "end_offset": 86,
> "type": "<ALPHANUM>",
> "position": 6
> },
> {
> "token": "optional",
> "start_offset": 87,
> "end_offset": 95,
> "type": "<ALPHANUM>",
> "position": 7
> },
> {
> "token": "for",
> "start_offset": 96,
> "end_offset": 99,
> "type": "<ALPHANUM>",
> "position": 8
> },
> {
> "token": "some",
> "start_offset": 100,
> "end_offset": 104,
> "type": "<ALPHANUM>",
> "position": 9
> },
> {
> "token": "light",
> "start_offset": 105,
> "end_offset": 110,
> "type": "<ALPHANUM>",
> "position": 10
> },
> {
> "token": "reading",
> "start_offset": 111,
> "end_offset": 118,
> "type": "<ALPHANUM>",
> "position": 11
> }
> ]
> }
> {code}
> The term from the extracted URL is correct, and correctly truncated at 64
> characters. But as you can see, the analysis pipeline also creates three
> spurious terms [ "eath", "is" "optional" ] which come from the discarded
> portion of the URL.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]