I've updated the process here: https://cwiki.apache.org/confluence/display/TIKA/TikaEvalOnVM
One of the key missing pieces was the batch-scripts.tgz file. Apparently, that attached file never made it during the confluence migration. I was able to reconstruct it from my email's sent box. :( Tilman noted the missing attachment a long, long time ago, and I finally got around to fixing it. Sorry it took me so long. I've kicked off the process with the most recent version of the PDFBox 2.x branch and with a bug fix for the problem uncovered in Tika in the last run. For anyone with access to the VM who wants to give this process a try, please do try it out after the current run finishes. If you only want to test a few hundred files, just shorten the fileList...see instructions. :D Again, I'm sorry for not taking care of this before I went on leave. Let me know how I can improve the documentation or anything else. Cheers, Tim On Mon, Aug 10, 2020 at 9:47 AM Tim Allison <talli...@apache.org> wrote: > Working on this now. Will post update when documentation is ready. > > On Wed, Aug 5, 2020 at 3:30 PM Andreas Lehmkuehler <andr...@lehmi.de> > wrote: > >> Am 05.08.20 um 17:19 schrieb Tim Allison: >> > Y, that's pretty close. >> > >> > Unfortunately, I'm away from my dev environment and can't access the vm >> to >> > confirm. I don't think I got the list of pdf files into a location >> where >> > you can see it. WIth enough permissions, you should be able to see it in >> > /data/work (???)...argh. >> I've read access to the whole corpus of files. I already compiled two >> tika >> versions to be used for the comparison. Unfortunately I wasn't able to >> run it as >> described at [1]. An exception occurred after some time and I gave up. >> >> > I'm sorry for not getting things in order before I left. I'll be back >> on >> > Monday. :( >> No need to worry, we didn't agree on any deadline for anything, so >> everything is >> fine. >> >> It would be cool if you rerun the tests (2.0.20 vs 2.0.21-SNAPSHOT) and >> maybe we >> can use your setup as template or so. >> >> Thanks in advance >> >> Andreas >> >> [1] https://cwiki.apache.org/confluence/display/TIKA/TikaEval >> > >> > >> > >> > >> > >> > On Sun, Aug 2, 2020 at 1:20 PM Andreas Lehmkuehler <andr...@lehmi.de> >> wrote: >> > >> >> Am 02.08.20 um 15:26 schrieb Maruan Sahyoun: >> >>> >> >>> Hi Andreas, >> >>> >> >>> I'll add you as a user. Details as pm. >> >> Access works, thanks Maruan! >> >> >> >> @Tim Is [1] still a valid documentation for the regression test run? >> >> >> >> [1] https://cwiki.apache.org/confluence/display/tika/TikaEvalOnVM >> >> >> >>> >> >>> BR >> >>> Maruan >> >>> >> >>> >> >>>> Hi, >> >>>> >> >>>> I'd like to get access to the corpora server to run the regression >> >> tests for >> >>>> PDFBox on my own, so that we don't have to bother Tim every time we >> >> want to cut >> >>>> a new release. Furthermore I'd like to run some 2.0.x vs trunk tests >> in >> >> the >> >>>> future and it'd handy to do that my self. >> >>>> >> >>>> What do I have to do to get access? >> >>>> >> >>>> Is there any documentation on how to configure the regressions test >> >> runner, or >> >>>> is it possible to simply copy and modify an existing installation? >> >>>> >> >>>> >> >>>> Cheers >> >>>> Andreas >> >> >> >> >> > >> >>