Hi, Am 19.05.23 um 17:25 schrieb Tim Allison:
All,Tilman Hausherr mentioned that we might want to update the common-crawl pdfs in our regression corpus. This proposal leaves the bugtracker PDFs as they are. For the CC-based PDFs, we could: 1) remove existing truncated pdfs 2) fold in newer untruncated PDFs from: https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/ What do you think?
I'm OK with that. Thanks for the effort! Andreas
Best, Tim