Hi!

Upstream (who is not overly excited with the idea of supporting random git snapshots of Tesseract) speaking here.

* Helmut Grohne <[email protected]>, 2018-01-02, 13:47:
But for the new tesseract the output is:

   Error opening data file /usr/share/tesseract-ocr/4.00/nonexistent.traineddata
   Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
   Failed loading language 'nonexistent'
   Tesseract couldn't load any languages!
   Could not initialize tesseract.

Note in particular that the error message lacks the tessdata subdirectory.

The commit that introduced this change seems to be:
https://github.com/tesseract-ocr/tesseract/commit/1cc511188d980a33742d2699f9927ed1c84e81de
(grep for "Try without tessdata")
The commit message doesn't explain why it was made. There's no changelog entrty for it either. Yay...

Anyway, I've implemented work-around in ocrodjvu:
https://github.com/jwilk/ocrodjvu/commit/b41f643d82f544cc15660e0d3292e31136e3d37b

In the long run, ocrodjvu should switch to using the --list-langs option. But this is currently super slow for some reason:

  $ time tesseract --list-langs > /dev/null

  real  0m0.367s
  user  0m0.333s
  sys   0m0.032s

--
Jakub Wilk

Reply via email to