All,

I think I finished deleting PDFs that were exactly 1 MB from commoncrawl3/
(those likely to be truncated by common crawl).

I also folded in 113k PDFs from
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/
(on our server:
https://corpora.tika.apache.org/base/docs/cc-main-2021-31-pdf-untruncated/)

I reran the process for file identification with 'file', Siegfried and
Tika, and I've updated datasette to point to the new db:
https://corpora.tika.apache.org/datasette

If you'd like to download the full sqlite db, it is available here:
https://corpora.tika.apache.org/base/share/tika-mimes-20230714.db.gz

If you have access to the server, I put new file lists for "random 1.2
million", PDFs (all) and ms-office here:
/data1/tools/tika/batch/new-file-lists/

We are using this new random 1.2 million file list in our regression tests
in preparation for the next release of Tika.  Yay!

Many thanks to DARPA and the SafeDocs program for funding this work and the
gathering of the 8 million PDFs in cc-main-2021-31-pdf-truncated.  Many
thanks, again, to CommonCrawl!

And, of course, thank you, Maruan Sahyoun, for generously setting up and
funding the corpora server!

Cheers,

        Tim

On Sat, May 20, 2023 at 5:44 AM Andreas Lehmkuehler <andr...@lehmi.de>
wrote:

> Hi,
>
> Am 19.05.23 um 17:25 schrieb Tim Allison:
> > All,
> >
> >    Tilman Hausherr mentioned that we might want to update the
> > common-crawl pdfs in our regression corpus.  This proposal leaves the
> > bugtracker PDFs as they are.
> >
> > For the CC-based PDFs, we could:
> >
> > 1) remove existing truncated pdfs
> >
> > 2) fold in newer untruncated PDFs from:
> >
> https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/
> >
> > What do you think?
> I'm OK with that.
>
> Thanks for the effort!
>
> Andreas
>
> >
> > Best,
> >
> >        Tim
>
>

Reply via email to