Hi,

Am 19.05.23 um 17:25 schrieb Tim Allison:
All,

   Tilman Hausherr mentioned that we might want to update the
common-crawl pdfs in our regression corpus.  This proposal leaves the
bugtracker PDFs as they are.

For the CC-based PDFs, we could:

1) remove existing truncated pdfs

2) fold in newer untruncated PDFs from:
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

What do you think?
I'm OK with that.

Thanks for the effort!

Andreas


Best,

       Tim

Reply via email to