[
https://issues.apache.org/jira/browse/TIKA-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892409#comment-17892409
]
ASF GitHub Bot commented on TIKA-4315:
--------------------------------------
ruwi-next commented on PR #1970:
URL: https://github.com/apache/tika/pull/1970#issuecomment-2434755713
I've had a go at implementing this, changes in
[c42466f](https://github.com/apache/tika/pull/1970/commits/c42466ff72f1470935ee01d792ef4d10d8c67f87),
I can squash into one commit if that is preferred.
The implementation uses the indices string that is provided in XPS. It is a
list of information for each glyph in a run. The useful information is the
advance which based on
https://learn.microsoft.com/en-us/windows/win32/api/xpsobjectmodel/ns-xpsobjectmodel-xps_glyph_index
is measured in 1/100 em. Using this we can calculate the distance between runs
and decide based on a threshold if a whitespace should be inserted. I have
added some test xps files that I made to test this.
This implementation has some assumptions and limitations. Mainly that we do
not get the glyph advance value for the last glyph in a run. I have used the
average advance or 0.5 as a fallback in this case.
It also sorts the runs based on LTR unless all runs in a row are RTL. This
maybe incorrect for cases where there is LTR and a multiple runs of RTL but I
am not knowledgeable in this area.
Any feedback is appreciated :)
> XPS file parser does not emit whitespace as expected
> ----------------------------------------------------
>
> Key: TIKA-4315
> URL: https://issues.apache.org/jira/browse/TIKA-4315
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.9.1, 2.9.2
> Reporter: Ruairidh Williamson
> Priority: Major
> Attachments: testXLSX.xps
>
>
> We are using tika to extract text from XPS files and have hit an issue where
> whitespace is not emitted where we would expect. See the attached example
> file where opening the file it visually has a large gap between "x" and
> "abcde1234f" but when extracted by tika it calls `characters` with "x" and
> then `characters` on "abcde1234f". We would expect a `ignorableWhitespace` in
> between those calls but we don't get one.
> I have a pull request that fixes the issue which I will submit.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)