[ https://issues.apache.org/jira/browse/TIKA-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892409#comment-17892409 ]
ASF GitHub Bot commented on TIKA-4315: -------------------------------------- ruwi-next commented on PR #1970: URL: https://github.com/apache/tika/pull/1970#issuecomment-2434755713 I've had a go at implementing this, changes in [c42466f](https://github.com/apache/tika/pull/1970/commits/c42466ff72f1470935ee01d792ef4d10d8c67f87), I can squash into one commit if that is preferred. The implementation uses the indices string that is provided in XPS. It is a list of information for each glyph in a run. The useful information is the advance which based on https://learn.microsoft.com/en-us/windows/win32/api/xpsobjectmodel/ns-xpsobjectmodel-xps_glyph_index is measured in 1/100 em. Using this we can calculate the distance between runs and decide based on a threshold if a whitespace should be inserted. I have added some test xps files that I made to test this. This implementation has some assumptions and limitations. Mainly that we do not get the glyph advance value for the last glyph in a run. I have used the average advance or 0.5 as a fallback in this case. It also sorts the runs based on LTR unless all runs in a row are RTL. This maybe incorrect for cases where there is LTR and a multiple runs of RTL but I am not knowledgeable in this area. Any feedback is appreciated :) > XPS file parser does not emit whitespace as expected > ---------------------------------------------------- > > Key: TIKA-4315 > URL: https://issues.apache.org/jira/browse/TIKA-4315 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.9.1, 2.9.2 > Reporter: Ruairidh Williamson > Priority: Major > Attachments: testXLSX.xps > > > We are using tika to extract text from XPS files and have hit an issue where > whitespace is not emitted where we would expect. See the attached example > file where opening the file it visually has a large gap between "x" and > "abcde1234f" but when extracted by tika it calls `characters` with "x" and > then `characters` on "abcde1234f". We would expect a `ignorableWhitespace` in > between those calls but we don't get one. > I have a pull request that fixes the issue which I will submit. -- This message was sent by Atlassian Jira (v8.20.10#820010)