Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-31 Thread Lorenzo Bolzani
You can have a look at ocrd-train https://github.com/OCR-D/ocrd-train You just have to prepare cropped tiff and txt files with the same name containing a single line of text. At the same time, if you already set up everything for the font based training, I'd give it a try (time permitting): you

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-31 Thread Daniel Ferenc
Is there a guide somewhere how to setup training like this? How to pair the images and text, etc..? And thank you for the insight, it really is helpful. On Thursday, January 31, 2019 at 11:18:35 AM UTC+1, Lorenzo Blz wrote: > > Yes, generating text is faster and easier. > > But the real extracted

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-31 Thread Lorenzo Bolzani
Yes, generating text is faster and easier. But the real extracted and cleaned text you are going to eventually recognize is going to be different from this, more or less depending on a lot of factors: - how similar your training font actually is - how good your cleanup will be (test this in advanc

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-30 Thread Daniel Ferenc
Oh, and one more thing - the same card with the same name can appear in different editions of Magic, so pure recognition by name is not enough, I'm also training my software to recognize the edition of the card by using different means so all that in combination should be quite enough. On Wedne

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-30 Thread Daniel Ferenc
I'm not sure how exactly would I setup that (regarding tesseract training) BUT there are about 44000 (english) cards at this time and a high resolution image of each is about 2 megs (at least from the resource I can get them from). Also, not each card is the same format so a generic crop functi

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-30 Thread Lorenzo Bolzani
If you have images of the cards with the corresponding text you could train it on the cropped/cleaned text directly. Il giorno mer 30 gen 2019 alle ore 15:41 Daniel Ferenc ha scritto: > So, I have figured out what was I doing wrong: > > - I am using tesseract packages I got from apt on ubuntu 18

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-30 Thread Daniel Ferenc
So, I have figured out what was I doing wrong: - I am using tesseract packages I got from apt on ubuntu 18.04 LTS and they were obviously missing some langdata which I downloaded from the repository - There was also a need to get the Latin.unicharsert file - And finally I didn't notice an error i

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-29 Thread Shree Devi Kumar
Finetune with your specific font - see eg. below which uses IMPACT font. #!/bin/bash time ~/tesseract/src/training/tesstrain.sh \ --fonts_dir /usr/share/fonts \ --lang eng --linedata_only \ --noextract_font_properties \ --langdata_dir ~/langdata \ --tessdata_dir ~/tessdata \ --fontlis

[tesseract-ocr] Training for a specific wordlist and font

2019-01-28 Thread Daniel Ferenc
Hi, I need to train Tesseract for only a specific wordlist (about 13600 words) and one specific font. I tried following the training tutorial on the Wiki but I'm not sure if i'm doing anything wrong - the traineddata file is about 7 megabytes and i combined it with the eng.traineddata to get an