[ 
https://issues.apache.org/jira/browse/TIKA-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894660#comment-17894660
 ] 

Tilman Hausherr edited comment on TIKA-4337 at 10/31/24 5:08 PM:
-----------------------------------------------------------------

We download during the build files if they are copyrighted, but don't keep 
these in the repository.
An example for this can be seen here: [ https://svn.apache.org/r1921706 ]
We keep files in the source code repository that are not copyrighted. The best 
would be to look for very small xps documents that have the features that are 
to be tested, until hitting one that is from a government.


was (Author: tilman):
We download during the build files if they are copyrighted, but don't keep 
these in the repository. We keep files in the source code repository that are 
not copyrighted. The best would be to look for very small xps documents that 
have the features that are to be tested, until hitting one that is from a 
government.

> Improvements to recent xps mods
> -------------------------------
>
>                 Key: TIKA-4337
>                 URL: https://issues.apache.org/jira/browse/TIKA-4337
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: xps-reports.tgz
>
>
> I pulled 249 xps files out of the latest commoncrawl crawl and compared 
> 3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few 
> number format exceptions where a comma-delimited string is parsed as if it 
> were an integer.
> Reports are attached.  See esp. new_exceptions_in_b_details.xlsx and 
> content_diffs_no_exceptions.xlsx.
> The source files are available here: 
> https://corpora.tika.apache.org/base/share/xps.tgz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to