I've updated the process here:

https://cwiki.apache.org/confluence/display/TIKA/TikaEvalOnVM

One of the key missing pieces was the batch-scripts.tgz file.  Apparently,
that attached file never made it during the confluence migration.  I was
able to reconstruct it from my email's sent box. :(  Tilman noted the
missing attachment a long, long time ago, and I finally got around to
fixing it.  Sorry it took me so long.

I've kicked off the process with the most recent version of the PDFBox 2.x
branch and with a bug fix for the problem uncovered in Tika in the last run.

For anyone with access to the VM who wants to give this process a try,
please do try it out after the current run finishes.  If you only want to
test a few hundred files, just shorten the fileList...see instructions. :D

Again, I'm sorry for not taking care of this before I went on leave.  Let
me know how I can improve the documentation or anything else.

Cheers,

           Tim

On Mon, Aug 10, 2020 at 9:47 AM Tim Allison <talli...@apache.org> wrote:

> Working on this now.  Will post update when documentation is ready.
>
> On Wed, Aug 5, 2020 at 3:30 PM Andreas Lehmkuehler <andr...@lehmi.de>
> wrote:
>
>> Am 05.08.20 um 17:19 schrieb Tim Allison:
>> > Y, that's pretty close.
>> >
>> > Unfortunately, I'm away from my dev environment and can't access the vm
>> to
>> > confirm.  I don't think I got the list of pdf files into a location
>> where
>> > you can see it. WIth enough permissions, you should be able to see it in
>> > /data/work (???)...argh.
>> I've read access to the whole corpus of files. I already compiled two
>> tika
>> versions to be used for the comparison. Unfortunately I wasn't able to
>> run it as
>> described at [1]. An exception occurred after some time and I gave up.
>>
>> > I'm sorry for not getting things in order before I left.  I'll be back
>> on
>> > Monday. :(
>> No need to worry, we didn't agree on any deadline for anything, so
>> everything is
>> fine.
>>
>> It would be cool if you rerun the tests (2.0.20 vs 2.0.21-SNAPSHOT) and
>> maybe we
>> can use your setup as template or so.
>>
>> Thanks in advance
>>
>> Andreas
>>
>> [1] https://cwiki.apache.org/confluence/display/TIKA/TikaEval
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Aug 2, 2020 at 1:20 PM Andreas Lehmkuehler <andr...@lehmi.de>
>> wrote:
>> >
>> >> Am 02.08.20 um 15:26 schrieb Maruan Sahyoun:
>> >>>
>> >>> Hi Andreas,
>> >>>
>> >>> I'll add you as a user. Details as pm.
>> >> Access works, thanks Maruan!
>> >>
>> >> @Tim Is [1] still a valid documentation for the regression test run?
>> >>
>> >> [1] https://cwiki.apache.org/confluence/display/tika/TikaEvalOnVM
>> >>
>> >>>
>> >>> BR
>> >>> Maruan
>> >>>
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> I'd like to get access to the corpora server to run the regression
>> >> tests for
>> >>>> PDFBox on my own, so that we don't have to bother Tim every time we
>> >> want to cut
>> >>>> a new release. Furthermore I'd like to run some 2.0.x vs trunk tests
>> in
>> >> the
>> >>>> future and it'd handy to do that my self.
>> >>>>
>> >>>> What do I have to do to get access?
>> >>>>
>> >>>> Is there any documentation on how to configure the regressions test
>> >> runner, or
>> >>>> is it possible to simply copy and modify an existing installation?
>> >>>>
>> >>>>
>> >>>> Cheers
>> >>>> Andreas
>> >>
>> >>
>> >
>>
>>

Reply via email to