[tesseract-ocr] Re: Improving results based on manually transcribed pages

Max Richey Sat, 04 Sep 2021 02:16:07 -0700

Like you, most of the online help about training for a single font was no 
help to me at all.

So, please know that it is easy to go even further down the rabbit hole 
than you are now.  

I have successfully trained a Tesseract 5.0 LSTM (not to be confused with 
LDS, lol) language model on a single font.  But I did it from scratch 
because the font and manuscript were so unique and included many ligatures. 
I also had to develop the font first and in three scripts (Greek, Hebrew, 
Latin).  And I am just now scanning and correcting the source images.

Along with learning to code in Python to prepare the images, I used the 
repository here:

https://github.com/tesseract-ocr/tesstrain

You may be in a good place to use this method, too.  Except that you would 
need to designate a pre-existing model to start training from, which I'm 
sure you have.
You also need a good quad-core computer, with Tesseract and Python 
installed.  Any remaining pre-requisites are listed on the repo site.

I would have to research it to know for sure if this would work in your 
case.  And I am up to my neck in my own project. 
I have, however, developed some Python tools that might be helpful.  They 
would split the image and text files into lines with matching names.
But there are many ways to skin those two cats. And you probably won't have 
to learn to code as I did. There may also be others that will post more 
insights.

Training is accomplished by providing ground truth .tif files that are the 
cropped, individual lines from the source image.
There must also be a matching one-line text file. Filename matching is 
critical.  There must be a lot of them.  

More is better.
I started with 440 lines of ground truth at 20,000 iterations (several 
hours duration).  
Using the scanned images I now have, I am still re-training at 250K 
iterations (2 days) with 7 times that number of lines.
I have achieved a character accuracy of 99.3% and a word accuracy of 
96.5%.  Case sensitivity is very good, too.

It was worth the effort, but like you, I must still manually correct the 
OCR scanned results.  
Now, I am only correcting about 2-4% instead of 100% like you had to.

The rough stuff?   I am 5 years into it,  I still have a lot of hope that 
the effort will be historical.  I cant' stop, now.
Perhaps you won't, either.  Hang in there.  Remember, it's hard to kick 
against the goads.

Maybe this will help, you.

Max

On Friday, September 3, 2021 at 3:33:47 AM UTC-7 thisisthe...@gmail.com 
wrote:

> Hi all
>
> I have approximately 13 variations of the same book to run OCR on.
>
> I have done the first variation and manually corrected all errors but, at 
> nearly 600 pages for only the first book, this process has taken too long.
>
> The font is the same throughout the variations, so I'd like to know how I 
> can use the current English train files and use my scans + text files to 
> improve accuracy.
>
> This is the kind of quality of the images I have manually checked and want 
> to use to improve accuracy
>
> [image: Source.png]
>
> And the subsequent images I need to OCR are mostly like this 
> [image: Source2.png]
> Note there are slight differences in the text, for example, "Record" and 
> "record" - which is the purpose of my project.
>
> Can anyone recommend an article or video for training a new font? Or 
> perhaps someone might be willing to help me with this for a payment?
>
> Thanks
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/10452fad-9a49-48ea-b3f3-360c93acceb9n%40googlegroups.com.

[tesseract-ocr] Re: Improving results based on manually transcribed pages

Reply via email to