Re: Datasets for testing large number of attachments

Nicholas DiPiazza Tue, 26 Jul 2022 12:13:22 -0700

Script I used back in the day to do what you are looking for:

#!/bin/bash
for i in $(seq -f "%03g" 2 999)
do
  wget
http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip
-O $i.zip
  unzip $i.zip
  rm $i.zip
done


not sure if it still works


On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <talli...@apache.org> wrote:

> As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
> of truncated files.  We refetched some and put those under
> commoncrawl3_refetched.
>
> On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <talli...@apache.org> wrote:
>
>> We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some
>> directories under commoncrawl3_refetched.
>>
>> If you want to pull fresher data out of CommonCrawl, I have undocumented
>> scripts to do that.  I could add documentation.
>>
>> These are the top 100 mime types and counts.  This db was generated on a
>> slightly earlier version of the corpus/corpora, but it should be close
>> enough.
>>
>> MIME_STRING    cnt
>> application/pdf    768490
>> text/plain    472041
>> text/html    429707
>> application/x-tika-msoffice    297990
>> image/png    190815
>> application/octet-stream    190645
>> image/jpeg    179533
>> application/xhtml+xml    151830
>> application/x-bzip2    124204
>> application/x-tika-ooxml    122523
>> application/x-bzip    107435
>> application/xml    107003
>> application/zip    93467
>> application/x-sh    88712
>> application/gzip    73535
>> image/gif    66713
>> application/zlib    46483
>> text/calendar    40385
>> application/postscript    35526
>> application/rss+xml    34428
>> application/atom+xml    28950
>> multipart/appledouble    27602
>> image/svg+xml    25771
>> application/vnd.oasis.opendocument.text    25753
>> application/rdf+xml    24890
>> application/vnd.google-earth.kml+xml    24049
>> application/rtf    23915
>> application/x-matroska    19437
>> application/x-shockwave-flash    18879
>> video/quicktime    18546
>> application/epub+zip    18205
>> application/vnd.ms-excel    17465
>> application/x-xz    16869
>> text/x-vcard    16772
>> application/java-vm    16761
>> audio/mpeg    15534
>> message/rfc822    14405
>> application/vnd.oasis.opendocument.spreadsheet    12659
>> application/x-bibtex-text-file    12261
>> application/x-rar-compressed; version=4    12123
>> text/x-php    10870
>> text/x-diff    10080
>> video/mp4    8281
>> audio/mp4    8221
>> application/x-msdownload    8019
>> application/x-bittorrent    7964
>> image/vnd.microsoft.icon    7382
>> application/mbox    6799
>> application/x-x509-cert; format=der    6597
>> audio/vnd.wave    6550
>> image/bmp    6411
>> application/x-endnote-refer    5922
>> image/vnd.djvu    5874
>> text/x-matlab    5734
>> application/vnd.apple.mpegurl    5511
>> image/tiff    5430
>> image/webp    4972
>> application/vnd.oasis.opendocument.presentation    3989
>> text/x-jsp    3973
>> text/x-csrc    3555
>> video/x-ms-wmv    3453
>> video/x-m4v    3443
>> application/x-dbf    3381
>> text/x-chdr    3263
>> text/x-perl    3124
>> application/x-rpm    3023
>> application/x-mobipocket-ebook    2726
>> audio/midi    2697
>> application/vnd.oasis.opendocument.graphics    2675
>> application/vnd.ms-excel.sheet.4    2591
>> application/x-font-ttf    2575
>> application/xspf+xml    2557
>> text/x-python    2416
>> audio/vorbis    2354
>> application/msword    2223
>> application/ogg    2222
>> application/x-gtar    2181
>> audio/x-mpegurl    2067
>> video/x-flv    1969
>> audio/x-ms-wma    1874
>> image/icns    1857
>> application/x-object    1823
>> application/x-7z-compressed    1795
>> application/x-msdownload; format=pe32    1784
>> application/x-debian-package    1700
>> application/x-mysql-table-definition    1669
>> image/vnd.dxf; format=ascii    1664
>> application/x-sqlite3    1606
>> application/x-berkeley-db; format=hash    1457
>> application/x-executable    1455
>> video/mpeg    1366
>> application/pkcs7-signature    1359
>> application/x-ms-asx    1266
>> image/vnd.zbrush.pcx    1247
>> image/vnd.dwg    1243
>> application/fits    1217
>> application/xslfo+xml    1206
>> application/x-sharedlib    1185
>> audio/prs.sid    1173
>> text/x-vcalendar    1156
>>
>>
>>
>>
>> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <
>> oscar.rieke...@cofense.com> wrote:
>>
>>> We were thinking something around 2TB of data with a good mix of excel,
>>> images, pdfs, text and powerpoints. So I guess a mix of everything.
>>>
>>>
>>>
>>> *From: *Tim Allison <talli...@apache.org>
>>> *Date: *Tuesday, July 26, 2022 at 9:19 AM
>>> *To: *u...@tika.apache.org <u...@tika.apache.org>
>>> *Cc: *Oscar Rieken Jr <oscar.rieke...@cofense.com>,
>>> corpora-dev@tika.apache.org <corpora-dev@tika.apache.org>
>>> *Subject: *Re: Datasets for testing large number of attachments
>>>
>>> External Email
>>>
>>> What Nick said...
>>>
>>>
>>>
>>> cc_large is a sample of some of the larger documents from
>>> commoncrawl3_refetched.
>>>
>>>
>>>
>>> If you want to give your pipeline a workout, I also recommend using the
>>> MockParser that is available in the tika-core tests jar.  That allows you
>>> to instrument an OOM and timeouts and system exits and all sorts of other
>>> mayhem.  Obv, don't put the tika-core tests jar on your class path in
>>> production.  See the files in
>>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
>>> for examples of how to trigger bad behavior with the MockParser.
>>>
>>>
>>>
>>> On the corpora, as Nick said, let us know what you want and we can help
>>> you select files.
>>>
>>>
>>>
>>> Cheers,
>>>
>>>
>>>
>>>         Tim
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <apa...@gagravarr.org> wrote:
>>>
>>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
>>> > I am currently trying to validate our Tika setup and was looking for a
>>> > set of example data I could use
>>>
>>> If you want a small number of files of lots of different types, the test
>>> files in the Tika source tree will work. Main set are in
>>> tika-parsers/src/test/resources/test-documents/
>>>
>>> If you want a very large number of files, then the Tika Corpora
>>> collection
>>> is a good source. We have a few different collections, including stuff
>>> from common crawl, govdocs and bug trackers. If you can let us know what
>>> sort of file types and how many, we can suggest the best corpora
>>> collection
>>>
>>> Nick
>>>
>>>

Re: Datasets for testing large number of attachments

Reply via email to