[ https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893490#comment-17893490 ]
Ruairidh Williamson commented on TIKA-4337: ------------------------------------------- [~tallison] I've fixed the two exceptions in that pull request. I will look at the content differences and fix that in the same pull request. >From an initial look, it looks like there are some assumptions being broken in >these files: * A glyph run must contain contiguous text, there is a file where it uses a large glyph advance to create the next column and then inserts a column in between as a separate glyph run. This means we should probably calculate each glyph position in the row, sort and then find the large spaces to emit whitespace. * Canvases can be a child of a VirtualBrush which can have a transform that moves the canvas. Currently we assume if the canvas clip string matches then it is part of the same canvas, but this is not true and canvases may actually be far apart because of the parent transformation. > Improvements to recent xps mods > ------------------------------- > > Key: TIKA-4337 > URL: https://issues.apache.org/jira/browse/TIKA-4337 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Minor > Attachments: xps-reports.tgz > > > I pulled 249 xps files out of the latest commoncrawl crawl and compared > 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few > number format exceptions where a comma-delimited string is parsed as if it > were an integer. > Reports are attached. See esp. new_exceptions_in_b_details.xlsx and > content_diffs_no_exceptions.xlsx. > The source files are available here: > https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)