All, I think I finished deleting PDFs that were exactly 1 MB from commoncrawl3/ (those likely to be truncated by common crawl).
I also folded in 113k PDFs from https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/ (on our server: https://corpora.tika.apache.org/base/docs/cc-main-2021-31-pdf-untruncated/) I reran the process for file identification with 'file', Siegfried and Tika, and I've updated datasette to point to the new db: https://corpora.tika.apache.org/datasette If you'd like to download the full sqlite db, it is available here: https://corpora.tika.apache.org/base/share/tika-mimes-20230714.db.gz If you have access to the server, I put new file lists for "random 1.2 million", PDFs (all) and ms-office here: /data1/tools/tika/batch/new-file-lists/ We are using this new random 1.2 million file list in our regression tests in preparation for the next release of Tika. Yay! Many thanks to DARPA and the SafeDocs program for funding this work and the gathering of the 8 million PDFs in cc-main-2021-31-pdf-truncated. Many thanks, again, to CommonCrawl! And, of course, thank you, Maruan Sahyoun, for generously setting up and funding the corpora server! Cheers, Tim On Sat, May 20, 2023 at 5:44 AM Andreas Lehmkuehler <andr...@lehmi.de> wrote: > Hi, > > Am 19.05.23 um 17:25 schrieb Tim Allison: > > All, > > > > Tilman Hausherr mentioned that we might want to update the > > common-crawl pdfs in our regression corpus. This proposal leaves the > > bugtracker PDFs as they are. > > > > For the CC-based PDFs, we could: > > > > 1) remove existing truncated pdfs > > > > 2) fold in newer untruncated PDFs from: > > > https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/ > > > > What do you think? > I'm OK with that. > > Thanks for the effort! > > Andreas > > > > > Best, > > > > Tim > >