Re: [tesseract-ocr] Re: Training from Scratch

Lorenzo Bolzani Fri, 24 Nov 2023 01:45:11 -0800

Hi Simon,
if I understand correctly how tesseract works, it follows this steps:

- it segments the image into lines of text
- it then takes each individual line and slides a small window, 1px wide I
think, over it, from one end to the other. For each step the model outputs
a prediction. The model, being an bidirectional LSTM has some memory of the
previous and following pixel columns.
- all these predictions are converted into characters using beam search

Please correct me if I got it wrong. So the first thing I think looking at
your picture is the segmentation step. Do you want to read the "< 0,05 A"
block only? Is the segmentation step able to isolate it? This is the first
thing I would try to understand.
Also your sample image for "<" has a very different angle to the one before
0,05.

In this case a would try to do a custom segmentation, looking for
rectangular boxes of a certain height, aspect ratio, etc. Then cropping
these out (maybe dropping the rectangular box and the black vertical lines)
and feed them to tesseract. This of course requires custom programming.

This might give good results even without fine tuning. I would try this
manually with GIMP first.

Also I suppose you are not going to encounter a lot of wild fonts into
these kind of diagrams. The more fonts you use, the harder the training. I
would focus on very few fonts, even one. I would start with exactly one
font and train on these to see quickly if my training setup/pipeline is
working. And if the training results reflect onto the diagrams later. If
the model error rate is good on the individual text lines and it is bad on
the real images it might be a segmentation problem that training cannot
fix. Or the problem might be the external box, that I suppose you do not
have in your generated data.

Ideally, I would use real crops from these diagrams rather than images from
text2image.

Also distinguishing 0 from O with many fonts is very hard. Often you have
domain knowledge that can help you to fix these errors in post, for example
0,O5 can be easily spotted and fixed. You can, for example, assume that
each box contains only one kind of data and guess the most likely one from
this or from the box sequence, etc.

I got good results with 20k samples (real world scanned docs, multi fonts).
10k seems reasonable, I also assume your output "characters set" is very
small, like the numbers and a few capital letters and a couple of symbols
(no %, ^, &, {, etc.).

Lorenzo

Il giorno gio 23 nov 2023 alle ore 10:16 Simon <[email protected]> ha
scritto:

> If I need to train new characters that are not recognized by a default
> model, is fine tuning in this case the right approach?
> One of these characters ist the one for angularity:  ∠
>
> This symbols appear in technical drawings and should be recognised in
> those. E.g. for the scenario in the following picture tesseract should
> reconize this symbol.
>
>
>
> [image: angularity.png]
>
> Also here is one of the pngs I tried to train with:
> [image: angularity_0_r0.jpg]
> They all look pretty similar to this one. Things that change are the
> angle, the propotion and the thickness of the lines. All examples have this
> 64x64 pixel box around it.
>
>
> Is Fine Tuning for this scenario the right approach as I only find
> information for fine tuning for specific fonts. For fine tune also the
> "tesstrain" repository would not be needed as it is used for training from
> scratch, correct?
> [email protected] schrieb am Mittwoch, 22. November 2023 um 15:27:02
> UTC+1:
>
>> From my limited experience, you need a lot more data than that to train
>> from scratch. If you can't make more than that data, you might first try to
>> fine tune:and then train by removing the top layer of the best model.
>>
>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 [email protected]
>> wrote:
>>
>>> As it is not properly possible to combine my traineddata from scratch
>>> with an existing one, I have decided to also train my traineddata model
>>> numbers. Therefore I wrote a script which synthetically generates
>>> groundtruth data with text2image.
>>> This script uses dozens of different fonts and creates numbers for the
>>> following formats.
>>> X.XXX
>>> X.XX
>>> X,XX
>>> X,XXX
>>> I generated 10,000 files to train the numbers. But unfortunately numbers
>>> get recognized pretty poorly with the best model. (most of times only "0.";
>>> "0" or "0," gets recognized)
>>> So I wanted to ask if It is not enough training (ground truth data) for
>>> proper recognition when I train several fonts.
>>> Thanks in advance for you help.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyqAvKuOGJixkJVfEbGUbPSSGnvPL4T9Uhc3b93aiYDSg%40mail.gmail.com.

Re: [tesseract-ocr] Re: Training from Scratch

Reply via email to