ruwi-next commented on PR #1970:
URL: https://github.com/apache/tika/pull/1970#issuecomment-2434755713

   I've had a go at implementing this, changes in 
[c42466f](https://github.com/apache/tika/pull/1970/commits/c42466ff72f1470935ee01d792ef4d10d8c67f87),
 I can squash into one commit if that is preferred.
   
   The implementation uses the indices string that is provided in XPS. It is a 
list of information for each glyph in a run. The useful information is the 
advance which based on 
https://learn.microsoft.com/en-us/windows/win32/api/xpsobjectmodel/ns-xpsobjectmodel-xps_glyph_index
 is measured in 1/100 em. Using this we can calculate the distance between runs 
and decide based on a threshold if a whitespace should be inserted. I have 
added some test xps files that I made to test this.
   
   This implementation has some assumptions and limitations. Mainly that we do 
not get the glyph advance value for the last glyph in a run. I have used the 
average advance or 0.5 as a fallback in this case.
   
   It also sorts the runs based on LTR unless all runs in a row are RTL. This 
maybe incorrect for cases where there is LTR and a multiple runs of RTL but I 
am not knowledgeable in this area.
   
   Any feedback is appreciated :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to