[ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893490#comment-17893490
 ] 

Ruairidh Williamson commented on TIKA-4337:
-------------------------------------------

[~tallison] I've fixed the two exceptions in that pull request. I will look at 
the content differences and fix that in the same pull request.

>From an initial look, it looks like there are some assumptions being broken in 
>these files:
 * A glyph run must contain contiguous text, there is a file where it uses a 
large glyph advance to create the next column and then inserts a column in 
between as a separate glyph run. This means we should probably calculate each 
glyph position in the row, sort and then find the large spaces to emit 
whitespace.
 * Canvases can be a child of a VirtualBrush which can have a transform that 
moves the canvas. Currently we assume if the canvas clip string matches then it 
is part of the same canvas, but this is not true and canvases may actually be 
far apart because of the parent transformation.

> Improvements to recent xps mods
> -------------------------------
>
>                 Key: TIKA-4337
>                 URL: https://issues.apache.org/jira/browse/TIKA-4337
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to