8 million PDFs/8TB from a month of Common Crawl. We refetched ~2 million truncated files.
Zips of PDFs are available here: https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/ Peter Wyatt (PDF Association)'s writeup is here: https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/ --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org