I'm going to preface this with, I haven't actually done an OCR run yet (by the time any replies come in, I probably will have, the source image editing is almost done).
I'm working with some photoscanned images of knitting related work (so, there are some non-word characters and acronyms used, most are still English but there are occasional symbols, some standard ascii or unicode, others specialty - I should be able to exclude the specialty symbols and keep them as an image, or at least I hope so), based on tesseract being a "groups of words" based recognition, it sounds like this might produce unexpected results? (example of a line that might show up that could cause a problem would be - K2, yo, k2tog, k to last 4, ssk, yo, k2 - doesn't look like English words, kind of looks like a sentence *if* you assume a space or comma denotes a that which came before is a word) So, in order to handle/fix stuff like that, without training, I'm looking for tips on how to inspect my hOCR files to verify and, if necessary, correct the results, that work on linux without running wine. I am looking into the tools suggested in the "Post OCR Verification and Editing" conversation, but that poster is on windows, with a different toolchain, so, not sure all apply to me. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com.