[ 
https://issues.apache.org/jira/browse/TIKA-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922764#comment-17922764
 ] 

Tim Allison commented on TIKA-4375:
-----------------------------------

We can add non-breaking spaces here: 
https://github.com/apache/tika/blob/main/tika-eval/tika-eval-core/src/main/resources/lucene-char-mapping.txt
 

We're using the UAX URL tokenizer, which clearly isn't tokenizing on those.

> Regression tests for 2.9.3 release
> ----------------------------------
>
>                 Key: TIKA-4375
>                 URL: https://issues.apache.org/jira/browse/TIKA-4375
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: LTWA2JGVJGJ5RVKHTUX6SDS4NTL5UJVQ-p139.pdf, 
> tika-2.9.2-v-tika-2.9.3-reports.tgz
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to