I've managed to improve tesseract results on some real life documents by using "tesseract ... batch.nochop makebox" and correct the box file. (in addition to adding spaces and EOL's) I do have some questions about the correct syntax for the box file.
1) If some of the characters in the tiff image are not represented in the box file, will it harm the training (in the sense that it will train tesseract to ignore those characters)? do i have to "get them all"? 2) how important are the coordinates of each characters. should i invest time on making sure they are exact? I understand LSTM works as a "line recogniser", how does that effects the training? 3) makebox generate a "~" character for lines in the document. will fixing those before training will help tesseract detect them better? Thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dedf81a7-b9e9-4c40-b7ab-bb7484b5c160%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.