What Nick said...

cc_large is a sample of some of the larger documents from
commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the
MockParser that is available in the tika-core tests jar.  That allows you
to instrument an OOM and timeouts and system exits and all sorts of other
mayhem.  Obv, don't put the tika-core tests jar on your class path in
production.  See the files in
https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you
select files.

Cheers,

        Tim


On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <apa...@gagravarr.org> wrote:

> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> > I am currently trying to validate our Tika setup and was looking for a
> > set of example data I could use
>
> If you want a small number of files of lots of different types, the test
> files in the Tika source tree will work. Main set are in
> tika-parsers/src/test/resources/test-documents/
>
> If you want a very large number of files, then the Tika Corpora collection
> is a good source. We have a few different collections, including stuff
> from common crawl, govdocs and bug trackers. If you can let us know what
> sort of file types and how many, we can suggest the best corpora
> collection
>
> Nick
>

Reply via email to