[ https://issues.apache.org/jira/browse/TIKA-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922764#comment-17922764 ]
Tim Allison commented on TIKA-4375: ----------------------------------- We can add non-breaking spaces here: https://github.com/apache/tika/blob/main/tika-eval/tika-eval-core/src/main/resources/lucene-char-mapping.txt We're using the UAX URL tokenizer, which clearly isn't tokenizing on those. > Regression tests for 2.9.3 release > ---------------------------------- > > Key: TIKA-4375 > URL: https://issues.apache.org/jira/browse/TIKA-4375 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > Attachments: LTWA2JGVJGJ5RVKHTUX6SDS4NTL5UJVQ-p139.pdf, > tika-2.9.2-v-tika-2.9.3-reports.tgz > > -- This message was sent by Atlassian Jira (v8.20.10#820010)