We have ~1.9TB. But I'd skip cc_large because that's just a copy of some directories under commoncrawl3_refetched.
If you want to pull fresher data out of CommonCrawl, I have undocumented scripts to do that. I could add documentation. These are the top 100 mime types and counts. This db was generated on a slightly earlier version of the corpus/corpora, but it should be close enough. MIME_STRING cnt application/pdf 768490 text/plain 472041 text/html 429707 application/x-tika-msoffice 297990 image/png 190815 application/octet-stream 190645 image/jpeg 179533 application/xhtml+xml 151830 application/x-bzip2 124204 application/x-tika-ooxml 122523 application/x-bzip 107435 application/xml 107003 application/zip 93467 application/x-sh 88712 application/gzip 73535 image/gif 66713 application/zlib 46483 text/calendar 40385 application/postscript 35526 application/rss+xml 34428 application/atom+xml 28950 multipart/appledouble 27602 image/svg+xml 25771 application/vnd.oasis.opendocument.text 25753 application/rdf+xml 24890 application/vnd.google-earth.kml+xml 24049 application/rtf 23915 application/x-matroska 19437 application/x-shockwave-flash 18879 video/quicktime 18546 application/epub+zip 18205 application/vnd.ms-excel 17465 application/x-xz 16869 text/x-vcard 16772 application/java-vm 16761 audio/mpeg 15534 message/rfc822 14405 application/vnd.oasis.opendocument.spreadsheet 12659 application/x-bibtex-text-file 12261 application/x-rar-compressed; version=4 12123 text/x-php 10870 text/x-diff 10080 video/mp4 8281 audio/mp4 8221 application/x-msdownload 8019 application/x-bittorrent 7964 image/vnd.microsoft.icon 7382 application/mbox 6799 application/x-x509-cert; format=der 6597 audio/vnd.wave 6550 image/bmp 6411 application/x-endnote-refer 5922 image/vnd.djvu 5874 text/x-matlab 5734 application/vnd.apple.mpegurl 5511 image/tiff 5430 image/webp 4972 application/vnd.oasis.opendocument.presentation 3989 text/x-jsp 3973 text/x-csrc 3555 video/x-ms-wmv 3453 video/x-m4v 3443 application/x-dbf 3381 text/x-chdr 3263 text/x-perl 3124 application/x-rpm 3023 application/x-mobipocket-ebook 2726 audio/midi 2697 application/vnd.oasis.opendocument.graphics 2675 application/vnd.ms-excel.sheet.4 2591 application/x-font-ttf 2575 application/xspf+xml 2557 text/x-python 2416 audio/vorbis 2354 application/msword 2223 application/ogg 2222 application/x-gtar 2181 audio/x-mpegurl 2067 video/x-flv 1969 audio/x-ms-wma 1874 image/icns 1857 application/x-object 1823 application/x-7z-compressed 1795 application/x-msdownload; format=pe32 1784 application/x-debian-package 1700 application/x-mysql-table-definition 1669 image/vnd.dxf; format=ascii 1664 application/x-sqlite3 1606 application/x-berkeley-db; format=hash 1457 application/x-executable 1455 video/mpeg 1366 application/pkcs7-signature 1359 application/x-ms-asx 1266 image/vnd.zbrush.pcx 1247 image/vnd.dwg 1243 application/fits 1217 application/xslfo+xml 1206 application/x-sharedlib 1185 audio/prs.sid 1173 text/x-vcalendar 1156 On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <oscar.rieke...@cofense.com> wrote: > We were thinking something around 2TB of data with a good mix of excel, > images, pdfs, text and powerpoints. So I guess a mix of everything. > > > > *From: *Tim Allison <talli...@apache.org> > *Date: *Tuesday, July 26, 2022 at 9:19 AM > *To: *u...@tika.apache.org <u...@tika.apache.org> > *Cc: *Oscar Rieken Jr <oscar.rieke...@cofense.com>, > corpora-dev@tika.apache.org <corpora-dev@tika.apache.org> > *Subject: *Re: Datasets for testing large number of attachments > > External Email > > What Nick said... > > > > cc_large is a sample of some of the larger documents from > commoncrawl3_refetched. > > > > If you want to give your pipeline a workout, I also recommend using the > MockParser that is available in the tika-core tests jar. That allows you > to instrument an OOM and timeouts and system exits and all sorts of other > mayhem. Obv, don't put the tika-core tests jar on your class path in > production. See the files in > https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock > for examples of how to trigger bad behavior with the MockParser. > > > > On the corpora, as Nick said, let us know what you want and we can help > you select files. > > > > Cheers, > > > > Tim > > > > > > On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <apa...@gagravarr.org> wrote: > > On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote: > > I am currently trying to validate our Tika setup and was looking for a > > set of example data I could use > > If you want a small number of files of lots of different types, the test > files in the Tika source tree will work. Main set are in > tika-parsers/src/test/resources/test-documents/ > > If you want a very large number of files, then the Tika Corpora collection > is a good source. We have a few different collections, including stuff > from common crawl, govdocs and bug trackers. If you can let us know what > sort of file types and how many, we can suggest the best corpora > collection > > Nick > >