I've managed to improve tesseract results on some real life documents by 
using "tesseract ... batch.nochop makebox" and correct the box file. (in 
addition to adding spaces and EOL's)
I do have some questions about the correct syntax for the box file.

1) If some of the characters in the tiff image are not represented in the 
box file, will it harm the training (in the sense that it will train 
tesseract to ignore those characters)? do i have to "get them all"?
2) how important are the coordinates of each characters. should i invest 
time on making sure they are exact? I understand LSTM works as a 
"line recogniser", how does that effects the training?
3) makebox  generate a  "~" character for lines in the document. will 
fixing those before training will help tesseract detect them better?

Thanks


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/dedf81a7-b9e9-4c40-b7ab-bb7484b5c160%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to