8 million PDFs/8TB from a month of Common Crawl.  We refetched ~2
million truncated files.

Zips of PDFs are available here:
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

Peter Wyatt (PDF Association)'s writeup is here:
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to