[ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894603#comment-17894603
 ] 

Ruairidh Williamson commented on TIKA-4337:
-------------------------------------------

The structural metadata is interesting and could be useful, especially for 
complicated cases with overlapping canvases. Unfortunately it doesn't appear to 
be present on all the documents so it would be an opportunistic improvement.

I've fixed up some of the text extraction issues like I said I would and I 
think my pull request is ready.

The only thing I haven't done is add any tests for these cases. What is the 
licencing for including the documents found in the crawl?

There were two documents in particular I was testing against that I think would 
be helpful to add to the test data. They demonstrate the glyph run splitting 
and VisualBrush issues: 
3e596fd4a6b1b5952333a33826c7b4763b255b96a9bd398467a7fb4c60c50f5e, 
db4284d3960c4ac5e64f6c67a1d15e37e22a9faeac95c7e051c5cf5b2cc4e385

I am not sure how to generate the content diff report so I am not sure how much 
improvement my changes would make.

> Improvements to recent xps mods
> -------------------------------
>
>                 Key: TIKA-4337
>                 URL: https://issues.apache.org/jira/browse/TIKA-4337
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to