You can have a look at ocrd-train
https://github.com/OCR-D/ocrd-train
You just have to prepare cropped tiff and txt files with the same name
containing a single line of text.
At the same time, if you already set up everything for the font based
training, I'd give it a try (time permitting): you
Is there a guide somewhere how to setup training like this? How to pair the
images and text, etc..? And thank you for the insight, it really is helpful.
On Thursday, January 31, 2019 at 11:18:35 AM UTC+1, Lorenzo Blz wrote:
>
> Yes, generating text is faster and easier.
>
> But the real extracted
Yes, generating text is faster and easier.
But the real extracted and cleaned text you are going to eventually
recognize is going to be different from this, more or less depending on a
lot of factors:
- how similar your training font actually is
- how good your cleanup will be (test this in advanc
Oh, and one more thing - the same card with the same name can appear in
different editions of Magic, so pure recognition by name is not enough, I'm
also training my software to recognize the edition of the card by using
different means so all that in combination should be quite enough.
On Wedne
I'm not sure how exactly would I setup that (regarding tesseract training)
BUT there are about 44000 (english) cards at this time and a high
resolution image of each is about 2 megs (at least from the resource I can
get them from). Also, not each card is the same format so a generic crop
functi
If you have images of the cards with the corresponding text you could train
it on the cropped/cleaned text directly.
Il giorno mer 30 gen 2019 alle ore 15:41 Daniel Ferenc
ha scritto:
> So, I have figured out what was I doing wrong:
>
> - I am using tesseract packages I got from apt on ubuntu 18
So, I have figured out what was I doing wrong:
- I am using tesseract packages I got from apt on ubuntu 18.04 LTS and they
were obviously missing some langdata which I downloaded from the repository
- There was also a need to get the Latin.unicharsert file
- And finally I didn't notice an error i
Finetune with your specific font - see eg. below which uses IMPACT font.
#!/bin/bash
time ~/tesseract/src/training/tesstrain.sh \
--fonts_dir /usr/share/fonts \
--lang eng --linedata_only \
--noextract_font_properties \
--langdata_dir ~/langdata \
--tessdata_dir ~/tessdata \
--fontlis
Hi,
I need to train Tesseract for only a specific wordlist (about 13600 words)
and one specific font. I tried following the training tutorial on the Wiki
but I'm not sure if i'm doing anything wrong - the traineddata file is
about 7 megabytes and i combined it with the eng.traineddata to get an
9 matches
Mail list logo