[ 
https://issues.apache.org/jira/browse/TIKA-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891487#comment-17891487
 ] 

ASF GitHub Bot commented on TIKA-4315:
--------------------------------------

THausherr commented on PR #1970:
URL: https://github.com/apache/tika/pull/1970#issuecomment-2426278282

   The PDF implementation can be found in PDFTextStripper.java in the PDFBox 
project.




> XPS file parser does not emit whitespace as expected
> ----------------------------------------------------
>
>                 Key: TIKA-4315
>                 URL: https://issues.apache.org/jira/browse/TIKA-4315
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.9.1, 2.9.2
>            Reporter: Ruairidh Williamson
>            Priority: Major
>         Attachments: testXLSX.xps
>
>
> We are using tika to extract text from XPS files and have hit an issue where 
> whitespace is not emitted where we would expect. See the attached example 
> file where opening the file it visually has a large gap between "x" and 
> "abcde1234f" but when extracted by tika it calls `characters` with "x" and 
> then `characters` on "abcde1234f". We would expect a `ignorableWhitespace` in 
> between those calls but we don't get one.
> I have a pull request that fixes the issue which I will submit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to