What Nick said... cc_large is a sample of some of the larger documents from commoncrawl3_refetched.
If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar. That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem. Obv, don't put the tika-core tests jar on your class path in production. See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser. On the corpora, as Nick said, let us know what you want and we can help you select files. Cheers, Tim On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <apa...@gagravarr.org> wrote: > On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote: > > I am currently trying to validate our Tika setup and was looking for a > > set of example data I could use > > If you want a small number of files of lots of different types, the test > files in the Tika source tree will work. Main set are in > tika-parsers/src/test/resources/test-documents/ > > If you want a very large number of files, then the Tika Corpora collection > is a good source. We have a few different collections, including stuff > from common crawl, govdocs and bug trackers. If you can let us know what > sort of file types and how many, we can suggest the best corpora > collection > > Nick >