Ruairidh Williamson created TIKA-4315:
-----------------------------------------

             Summary: XPS file parser does not emit whitespace as expected
                 Key: TIKA-4315
                 URL: https://issues.apache.org/jira/browse/TIKA-4315
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.9.2, 2.9.1
            Reporter: Ruairidh Williamson
         Attachments: testXLSX.xps

We are using tika to extract text from XPS files and have hit an issue where 
whitespace is not emitted where we would expect. See the attached example file 
where opening the file it visually has a large gap between "x" and "abcde1234f" 
but when extracted by tika it calls `characters` with "x" and then `characters` 
on "abcde1234f". We would expect a `ignorableWhitespace` in between those calls 
but we don't get one.

I have a pull request that fixes the issue which I will submit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to