Tim Allison created TIKA-4337:
---------------------------------

             Summary: Improvements to recent xps mods
                 Key: TIKA-4337
                 URL: https://issues.apache.org/jira/browse/TIKA-4337
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison
         Attachments: xps-reports.tgz

I pulled 249 xps files out of the latest commoncrawl crawl and compared 
3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
number format exceptions where a comma-delimited string is parsed as if it were 
an integer.

Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
content_diffs_no_exceptions.xlsx.

The source files are available here: 
https://corpora.tika.apache.org/base/share/xps.tgz





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to