[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006015#comment-14006015 ]
Giuseppe Totaro commented on TIKA-1302: --------------------------------------- Hi Tim, I refer to metadata schema of each govdocs1 file. In http://digitalcorpora.org/corpora/files, you can read: {quote} The following metadata is provided for each of the files: The URL from which the file was downloaded. The date and time of the download. The search term that was used. The search engine that provided the document. The length and SHA1 of the file. A Simple Dublin Core for the file. {quote} Of course when our paper will be published I'll try to explain more detail our work and dataset. > Let's run Tika against a large batch of docs nightly > ---------------------------------------------------- > > Key: TIKA-1302 > URL: https://issues.apache.org/jira/browse/TIKA-1302 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > > Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and > running again, it might be fun to run Tika regularly against a large set of > docs and report metrics. > One excellent candidate corpus is govdocs1: > http://digitalcorpora.org/corpora/files. > Any other candidate corpora? > [~willp-bl], have anything handy you'd like to contribute? > [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] > ;) -- This message was sent by Atlassian JIRA (v6.2#6252)