[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047119#comment-14047119
 ] 

Maruan Sahyoun commented on TIKA-1300:
--------------------------------------

Form their website:

—
Please note that the files in this corpus are verbatim copies of files 
downloaded from USG webservers. We are aware that some of these files contain 
malware in the form of JavaScript exploits and Windows malware that was sent to 
mailing lists (that are now present in the mailing list archives). Although 
this may trigger some anti-virus programs, the malware will not be removed from 
the files because it is legitimately part of the corpus.

A malware scan of the govdocs1 directory is now available from 
http://digitalcorpora.org/corp/nps/files/govdocs1/MetascanClientLog_201306281214.txt
 .
—

So the link mentioned above has the results of the malware scan.


> Switch default PDFBox parser to NonSequentialParser
> ---------------------------------------------------
>
>                 Key: TIKA-1300
>                 URL: https://issues.apache.org/jira/browse/TIKA-1300
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.7
>
>         Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to