RE: {EXTERNAL}[tesseract-ocr] Installing tessdata

Peter Kronenberg Thu, 28 Jan 2021 06:43:11 -0800

Thanks for those links.  I think what I’m looking for is a more practical 
understanding of some of the differences, instead of technical details, which, 
not being a domain expert, I don’t fully understand.


For instance, I understand that there are 2 types of models, the LSTM OCR 
engine and the legacy engine.  What is the practical difference between the 
two.  In other words, if I go with the ‘best’ or ‘fast’ models, which only do 
LSTM OCR, what am I missing out on by not having legacy?  Is there any reason I 
would stick with the legacy models at https://github.com/tesseract-ocr/tessdata

As for the difference between ‘fast’ and ‘best’, is there any quantitative 
difference that someone can point me to?  In other words, how much better is 
‘best’ and how much more time does it take.  I guess I’m trying to decide the 
best one (no pun intended) for my application.

For the scripts, I haven’t found much definitive documentation on those.  If I 
use a Script language, is that equivalent to just specifying all the languages 
that use that script?  Is there any downside?  Do all the scripts contain 
English?   For example, if the language I’m dealing with is German, could I 
just specify Latin?  Or would it be more accurate to specify ‘deu’.  For 
something like Arabic, if I specified a script of Arabic, would that include 
Arabic, Farsi and other similar languages that use the same alphabet?  Would it 
be just as accurate as specifying the specific language?  And does the Arabic 
script contain English as well, so it could handle a mixed document?

Thank you
Peter

From: tesseract-ocr@googlegroups.com <tesseract-ocr@googlegroups.com> On Behalf 
Of Shree Devi Kumar
Sent: Wednesday, January 27, 2021 8:41 PM
To: tesseract-ocr <tesseract-ocr@googlegroups.com>
Subject: Re: {EXTERNAL}[tesseract-ocr] Installing tessdata

Please see

https://tesseract-ocr.github.io/tessdoc/Data-Files.html

Also the readme files in the three repos

https://github.com/tesseract-ocr/tessdata_fast


On Thu, Jan 28, 2021, 03:20 Peter Kronenberg 
<peter.kronenb...@torch.ai<mailto:peter.kronenb...@torch.ai>> wrote:
Hi, can someone help with these questions?  Just trying to understand better 
how the language models are used and what is the difference between them.

Thanks
Peter

From: tesseract-ocr@googlegroups.com<mailto:tesseract-ocr@googlegroups.com> 
<tesseract-ocr@googlegroups.com<mailto:tesseract-ocr@googlegroups.com>> On 
Behalf Of Peter Kronenberg
Sent: Thursday, January 21, 2021 12:59 PM
To: tesseract-ocr@googlegroups.com<mailto:tesseract-ocr@googlegroups.com>
Subject: {EXTERNAL}[tesseract-ocr] Installing tessdata

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click 
links or open attachments unless you recognize the sender and know the content 
is safe.
I see that the default tessdata just has English and OSD.  I see all the other 
data at https://github.com/tesseract-ocr/tessdata.  Do I just copy those to the 
same tessdata directory?  The repo has a much larger version of eng.traineddata 
than what comes with Tesseract.  Can I just replace it?
And what is the difference of the ones in the script directory?

In the directory from the initial install, not only do I have eng.traineddata, 
but there is also user-patterns, user-words and other files.  Do those files 
exist for the other languages as well?
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268642993B65C83511CFAF88E7A19%40MN2PR20MB2686.namprd20.prod.outlook.com<https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268642993B65C83511CFAF88E7A19%40MN2PR20MB2686.namprd20.prod.outlook.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268647BB8BA42CE575E06764E7BB9%40MN2PR20MB2686.namprd20.prod.outlook.com<https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268647BB8BA42CE575E06764E7BB9%40MN2PR20MB2686.namprd20.prod.outlook.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVdDOzm5GMF0jSvfw7vSpMqeDRH%3Db90Qza4L%2B3tMM5UWg%40mail.gmail.com<https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVdDOzm5GMF0jSvfw7vSpMqeDRH%3Db90Qza4L%2B3tMM5UWg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB2686AA9B35BEDFE8EC8EA618E7BA9%40MN2PR20MB2686.namprd20.prod.outlook.com.

RE: {EXTERNAL}[tesseract-ocr] Installing tessdata

Reply via email to