[ 
https://issues.apache.org/jira/browse/TIKA-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922759#comment-17922759
 ] 

Tilman Hausherr edited comment on TIKA-4375 at 1/31/25 4:13 PM:
----------------------------------------------------------------

Although I wrote elsewhere not to bother with PDF files I did look at one  
[^LTWA2JGVJGJ5RVKHTUX6SDS4NTL5UJVQ-p139.pdf], it has "narrow non breaking 
space" and "thin space" within words so we have tokens like "dezember 2014". 
I'm not sure if this is a PDFBox bug, a tika bug, or not a bug at all.


was (Author: tilman):
Although I wrote elsewhere not to bother with PDF files I did look at one  
[^LTWA2JGVJGJ5RVKHTUX6SDS4NTL5UJVQ-p139.pdf], it has "narrow non breaking 
space" and "thin space" within words so we have tokens like "dezember 2014".

> Regression tests for 2.9.3 release
> ----------------------------------
>
>                 Key: TIKA-4375
>                 URL: https://issues.apache.org/jira/browse/TIKA-4375
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: LTWA2JGVJGJ5RVKHTUX6SDS4NTL5UJVQ-p139.pdf, 
> tika-2.9.2-v-tika-2.9.3-reports.tgz
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to