[ 
https://issues.apache.org/jira/browse/TIKA-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892466#comment-17892466
 ] 

Hudson commented on TIKA-4315:
------------------------------

SUCCESS: Integrated in Jenkins build Tika ยป tika-branch_3x-jdk11 #1851 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch_3x-jdk11/1851/])
[TIKA-4315] Fix XPS whitespace not being emitted (#1970) (tallison: 
[https://github.com/apache/tika/commit/6eeeb185685a7015316791e44848bb5d68613a25])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xps/XPSParserTest.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/test_text.xps
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xps/XPSPageContentHandler.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testXLSX.xps


> XPS file parser does not emit whitespace as expected
> ----------------------------------------------------
>
>                 Key: TIKA-4315
>                 URL: https://issues.apache.org/jira/browse/TIKA-4315
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.9.1, 2.9.2
>            Reporter: Ruairidh Williamson
>            Priority: Major
>             Fix For: 2.9.3, 3.0.1, 4.0.0
>
>         Attachments: testXLSX.xps
>
>
> We are using tika to extract text from XPS files and have hit an issue where 
> whitespace is not emitted where we would expect. See the attached example 
> file where opening the file it visually has a large gap between "x" and 
> "abcde1234f" but when extracted by tika it calls `characters` with "x" and 
> then `characters` on "abcde1234f". We would expect a `ignorableWhitespace` in 
> between those calls but we don't get one.
> I have a pull request that fixes the issue which I will submit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to