Script I used back in the day to do what you are looking for: #!/bin/bash for i in $(seq -f "%03g" 2 999) do wget http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip -O $i.zip unzip $i.zip rm $i.zip done
not sure if it still works On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <talli...@apache.org> wrote: > As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch > of truncated files. We refetched some and put those under > commoncrawl3_refetched. > > On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <talli...@apache.org> wrote: > >> We have ~1.9TB. But I'd skip cc_large because that's just a copy of some >> directories under commoncrawl3_refetched. >> >> If you want to pull fresher data out of CommonCrawl, I have undocumented >> scripts to do that. I could add documentation. >> >> These are the top 100 mime types and counts. This db was generated on a >> slightly earlier version of the corpus/corpora, but it should be close >> enough. >> >> MIME_STRING cnt >> application/pdf 768490 >> text/plain 472041 >> text/html 429707 >> application/x-tika-msoffice 297990 >> image/png 190815 >> application/octet-stream 190645 >> image/jpeg 179533 >> application/xhtml+xml 151830 >> application/x-bzip2 124204 >> application/x-tika-ooxml 122523 >> application/x-bzip 107435 >> application/xml 107003 >> application/zip 93467 >> application/x-sh 88712 >> application/gzip 73535 >> image/gif 66713 >> application/zlib 46483 >> text/calendar 40385 >> application/postscript 35526 >> application/rss+xml 34428 >> application/atom+xml 28950 >> multipart/appledouble 27602 >> image/svg+xml 25771 >> application/vnd.oasis.opendocument.text 25753 >> application/rdf+xml 24890 >> application/vnd.google-earth.kml+xml 24049 >> application/rtf 23915 >> application/x-matroska 19437 >> application/x-shockwave-flash 18879 >> video/quicktime 18546 >> application/epub+zip 18205 >> application/vnd.ms-excel 17465 >> application/x-xz 16869 >> text/x-vcard 16772 >> application/java-vm 16761 >> audio/mpeg 15534 >> message/rfc822 14405 >> application/vnd.oasis.opendocument.spreadsheet 12659 >> application/x-bibtex-text-file 12261 >> application/x-rar-compressed; version=4 12123 >> text/x-php 10870 >> text/x-diff 10080 >> video/mp4 8281 >> audio/mp4 8221 >> application/x-msdownload 8019 >> application/x-bittorrent 7964 >> image/vnd.microsoft.icon 7382 >> application/mbox 6799 >> application/x-x509-cert; format=der 6597 >> audio/vnd.wave 6550 >> image/bmp 6411 >> application/x-endnote-refer 5922 >> image/vnd.djvu 5874 >> text/x-matlab 5734 >> application/vnd.apple.mpegurl 5511 >> image/tiff 5430 >> image/webp 4972 >> application/vnd.oasis.opendocument.presentation 3989 >> text/x-jsp 3973 >> text/x-csrc 3555 >> video/x-ms-wmv 3453 >> video/x-m4v 3443 >> application/x-dbf 3381 >> text/x-chdr 3263 >> text/x-perl 3124 >> application/x-rpm 3023 >> application/x-mobipocket-ebook 2726 >> audio/midi 2697 >> application/vnd.oasis.opendocument.graphics 2675 >> application/vnd.ms-excel.sheet.4 2591 >> application/x-font-ttf 2575 >> application/xspf+xml 2557 >> text/x-python 2416 >> audio/vorbis 2354 >> application/msword 2223 >> application/ogg 2222 >> application/x-gtar 2181 >> audio/x-mpegurl 2067 >> video/x-flv 1969 >> audio/x-ms-wma 1874 >> image/icns 1857 >> application/x-object 1823 >> application/x-7z-compressed 1795 >> application/x-msdownload; format=pe32 1784 >> application/x-debian-package 1700 >> application/x-mysql-table-definition 1669 >> image/vnd.dxf; format=ascii 1664 >> application/x-sqlite3 1606 >> application/x-berkeley-db; format=hash 1457 >> application/x-executable 1455 >> video/mpeg 1366 >> application/pkcs7-signature 1359 >> application/x-ms-asx 1266 >> image/vnd.zbrush.pcx 1247 >> image/vnd.dwg 1243 >> application/fits 1217 >> application/xslfo+xml 1206 >> application/x-sharedlib 1185 >> audio/prs.sid 1173 >> text/x-vcalendar 1156 >> >> >> >> >> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr < >> oscar.rieke...@cofense.com> wrote: >> >>> We were thinking something around 2TB of data with a good mix of excel, >>> images, pdfs, text and powerpoints. So I guess a mix of everything. >>> >>> >>> >>> *From: *Tim Allison <talli...@apache.org> >>> *Date: *Tuesday, July 26, 2022 at 9:19 AM >>> *To: *u...@tika.apache.org <u...@tika.apache.org> >>> *Cc: *Oscar Rieken Jr <oscar.rieke...@cofense.com>, >>> corpora-dev@tika.apache.org <corpora-dev@tika.apache.org> >>> *Subject: *Re: Datasets for testing large number of attachments >>> >>> External Email >>> >>> What Nick said... >>> >>> >>> >>> cc_large is a sample of some of the larger documents from >>> commoncrawl3_refetched. >>> >>> >>> >>> If you want to give your pipeline a workout, I also recommend using the >>> MockParser that is available in the tika-core tests jar. That allows you >>> to instrument an OOM and timeouts and system exits and all sorts of other >>> mayhem. Obv, don't put the tika-core tests jar on your class path in >>> production. See the files in >>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock >>> for examples of how to trigger bad behavior with the MockParser. >>> >>> >>> >>> On the corpora, as Nick said, let us know what you want and we can help >>> you select files. >>> >>> >>> >>> Cheers, >>> >>> >>> >>> Tim >>> >>> >>> >>> >>> >>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <apa...@gagravarr.org> wrote: >>> >>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote: >>> > I am currently trying to validate our Tika setup and was looking for a >>> > set of example data I could use >>> >>> If you want a small number of files of lots of different types, the test >>> files in the Tika source tree will work. Main set are in >>> tika-parsers/src/test/resources/test-documents/ >>> >>> If you want a very large number of files, then the Tika Corpora >>> collection >>> is a good source. We have a few different collections, including stuff >>> from common crawl, govdocs and bug trackers. If you can let us know what >>> sort of file types and how many, we can suggest the best corpora >>> collection >>> >>> Nick >>> >>>