[tesseract-ocr] Question about training data and psm

Neil Du Toit Wed, 01 Dec 2021 03:07:49 -0800

Hey

I've got a simple question and then I'll provide more context. I want to 
know whether I can fine-tune train tesseract using image/text pairs where 
each pair is only a single word.


My understanding is that training happens on "line-level" data (which is 
how tesstrain describes it). The problem is that while this rules out using 
multi-line input, it doesn't necessarily rule out using single words. 
However I suspect that if training expects a full line of text then feeding 
in single words might yield bad results?

It looks like tesstrain allows you to set the training psm but does this 
change anything because training is always on line level data?

I have looked for example ground truths on github and found several. Most 
of the training examples are full lines but I've seen the occasional 
single-word training pair.

The reason that I want to use single words is because I have built a 
curation interface for fixing tesseract errors after ocr and the interface 
operates at the word level. So I am generating word image / correct text 
pairs for every word that tesseract gets wrong and I want to feed this data 
back into fine tuning tesseract in like a batch reinforcement learning type 
setup.

Thanks so much.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e80ae734-89e8-47d0-98ff-f231f2ff5cafn%40googlegroups.com.

[tesseract-ocr] Question about training data and psm

Reply via email to