[ https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894629#comment-17894629 ]
Tim Allison commented on TIKA-4337: ----------------------------------- Y, I completely agree about the "opportunistic improvement." I think this could be an area for future work, but it is not applicable broadly. The licenses for those files are definitely not Apache 2.0 compliant... so we can't include them directly n our unit tests. :( However, I could put them in our regression corpus, and we'd see changes whenever we run large scale regression testing before a release. This is not ideal, but is the best we can do. Do any fellow devs ([~tilman] [~nick] ?) know if we could try to download the files as part of the build process and then incorporate local copies into unit tests? I know PDFBox downloads some files for unit tests, but I don't know what they're licensing is... Or does this go against the spirit of the Apache license? > Improvements to recent xps mods > ------------------------------- > > Key: TIKA-4337 > URL: https://issues.apache.org/jira/browse/TIKA-4337 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Minor > Attachments: xps-reports.tgz > > > I pulled 249 xps files out of the latest commoncrawl crawl and compared > 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few > number format exceptions where a comma-delimited string is parsed as if it > were an integer. > Reports are attached. See esp. new_exceptions_in_b_details.xlsx and > content_diffs_no_exceptions.xlsx. > The source files are available here: > https://corpora.tika.apache.org/base/share/xps.tgz -- This message was sent by Atlassian Jira (v8.20.10#820010)