Ruairidh Williamson created TIKA-4315:
-----------------------------------------
Summary: XPS file parser does not emit whitespace as expected
Key: TIKA-4315
URL: https://issues.apache.org/jira/browse/TIKA-4315
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 2.9.2, 2.9.1
Reporter: Ruairidh Williamson
Attachments: testXLSX.xps
We are using tika to extract text from XPS files and have hit an issue where
whitespace is not emitted where we would expect. See the attached example file
where opening the file it visually has a large gap between "x" and "abcde1234f"
but when extracted by tika it calls `characters` with "x" and then `characters`
on "abcde1234f". We would expect a `ignorableWhitespace` in between those calls
but we don't get one.
I have a pull request that fixes the issue which I will submit.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)