[ 
https://issues.apache.org/jira/browse/TIKA-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892409#comment-17892409
 ] 

ASF GitHub Bot commented on TIKA-4315:
--------------------------------------

ruwi-next commented on PR #1970:
URL: https://github.com/apache/tika/pull/1970#issuecomment-2434755713

   I've had a go at implementing this, changes in 
[c42466f](https://github.com/apache/tika/pull/1970/commits/c42466ff72f1470935ee01d792ef4d10d8c67f87),
 I can squash into one commit if that is preferred.
   
   The implementation uses the indices string that is provided in XPS. It is a 
list of information for each glyph in a run. The useful information is the 
advance which based on 
https://learn.microsoft.com/en-us/windows/win32/api/xpsobjectmodel/ns-xpsobjectmodel-xps_glyph_index
 is measured in 1/100 em. Using this we can calculate the distance between runs 
and decide based on a threshold if a whitespace should be inserted. I have 
added some test xps files that I made to test this.
   
   This implementation has some assumptions and limitations. Mainly that we do 
not get the glyph advance value for the last glyph in a run. I have used the 
average advance or 0.5 as a fallback in this case.
   
   It also sorts the runs based on LTR unless all runs in a row are RTL. This 
maybe incorrect for cases where there is LTR and a multiple runs of RTL but I 
am not knowledgeable in this area.
   
   Any feedback is appreciated :)




> XPS file parser does not emit whitespace as expected
> ----------------------------------------------------
>
>                 Key: TIKA-4315
>                 URL: https://issues.apache.org/jira/browse/TIKA-4315
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.9.1, 2.9.2
>            Reporter: Ruairidh Williamson
>            Priority: Major
>         Attachments: testXLSX.xps
>
>
> We are using tika to extract text from XPS files and have hit an issue where 
> whitespace is not emitted where we would expect. See the attached example 
> file where opening the file it visually has a large gap between "x" and 
> "abcde1234f" but when extracted by tika it calls `characters` with "x" and 
> then `characters` on "abcde1234f". We would expect a `ignorableWhitespace` in 
> between those calls but we don't get one.
> I have a pull request that fixes the issue which I will submit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to