All, On a dev branch, I replaced Optimaize with a dev version of OpenNLP's language detector, and I updated the common tokens list to cover the 120 langs covered by a dev version of OpenNLP's language model. I changed the min token length for common words to 3 (from 4), and I'm now using 30k common tokens per lang rather than 20k.
I reran this dev version of tika-eval on PDFBox 2.0.15 vs 2.0.16-SNAPSHOT, and the results are here: http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz Are there any critical problems with the updates in the contents comparison files? Any improvements? I notice that 'cmn' is the most common category for 'not much actual text'...we may want to require a higher confidence in language detection before reporting a detected language... Any and all recommendations are welcomed! Thank you! Cheers, Tim On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler <andr...@lehmi.de> wrote: > > Am 12.06.19 um 21:08 schrieb Tilman Hausherr: > > Am 12.06.2019 um 03:56 schrieb Tim Allison: > >> Reports are available here for 2.0.16-SNAPSHOT: > >> > >> http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz > >> > >> I haven't had a chance to look yet... > > > > > > I did... It's not looking good. It's probably the change in the ToUnicode > > stream > > parsing, I'll investigate this. > I'm going to have a look > > Andreas > > > > Tilman > > > > > > > >> > >> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <talli...@apache.org> wrote: > >>> +1 > >>> > >>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler <andr...@lehmi.de> > >>> wrote: > >>>> Hi, > >>>> > >>>> looks like it's time for the next release. How about cutting 2.0.16 in > >>>> about 2 > >>>> weeks from now? > >>>> > >>>> WDYT? > >>>> > >>>> Andreas > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>>> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org >