[COMPRESS] zip-based entry names/metadata data set available

2019-04-22 Thread Tim Allison
All, For some recent work on Apache Tika, I used commons-compress to extract entry names and metadata via a streaming read from roughly 500k zip-based files we have in Tika's regression corpus. I was happy to see we have some POI-generated files in there. :) I noticed some areas for improveme

[COMPRESS and Tika/PDFBox/POI] files from bug trackers

2020-02-14 Thread Tim Allison
All, I recently downloaded attachments from the following bug trackers: COMPRESS, TIKA, PDFBox, POI, Open Office, Libre Office and ghostscript: http://162.242.228.174/docs/bugtrackers/ I then unpackaged/uncompressed all of the package/compressed files so: COMPRESS-115-1.zip is the second fil

[COMPRESS] Tika's regression corpus

2020-06-05 Thread Tim Allison
@Compress devs, We recently transitioned our vm to a new provider, and we're improving the ASF-itude of this project. We recently started a new email list for those interested in guiding and using the 2 TB of files that we've gathered so far. Please join corpora-...@tika.apache.org if you ha