ruwi-next commented on PR #1970: URL: https://github.com/apache/tika/pull/1970#issuecomment-2434755713
I've had a go at implementing this, changes in [c42466f](https://github.com/apache/tika/pull/1970/commits/c42466ff72f1470935ee01d792ef4d10d8c67f87), I can squash into one commit if that is preferred. The implementation uses the indices string that is provided in XPS. It is a list of information for each glyph in a run. The useful information is the advance which based on https://learn.microsoft.com/en-us/windows/win32/api/xpsobjectmodel/ns-xpsobjectmodel-xps_glyph_index is measured in 1/100 em. Using this we can calculate the distance between runs and decide based on a threshold if a whitespace should be inserted. I have added some test xps files that I made to test this. This implementation has some assumptions and limitations. Mainly that we do not get the glyph advance value for the last glyph in a run. I have used the average advance or 0.5 as a fallback in this case. It also sorts the runs based on LTR unless all runs in a row are RTL. This maybe incorrect for cases where there is LTR and a multiple runs of RTL but I am not knowledgeable in this area. Any feedback is appreciated :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org