In your scenario, I would check performance of both modern lstm (v4/v5 engine) and old "classic" v3 OCR engine in tesseract. Just for completeness sake; first tests would be in separate runs so I'ld be able to check the output quality of both runs into HOCR format. (2 separate runs so I don't have to bother within tesseract internal heuristic to "pick the best one" and only dump that one: if I were you I'ld want to see both processes' performance and decide what to do after that.
Postprocessing is akin to "fixing it in the mix": you only do that when all other options have been depleted. On Sun, 24 Mar 2024, 19:29 Misti Hamon, <mistiha...@gmail.com> wrote: > I'm going to preface this with, I haven't actually done an OCR run yet (by > the time any replies come in, I probably will have, the source image > editing is almost done). > > I'm working with some photoscanned images of knitting related work (so, > there are some non-word characters and acronyms used, most are still > English but there are occasional symbols, some standard ascii or unicode, > others specialty - I should be able to exclude the specialty symbols and > keep them as an image, or at least I hope so), based on tesseract being a > "groups of words" based recognition, it sounds like this might produce > unexpected results? (example of a line that might show up that could > cause a problem would be - K2, yo, k2tog, k to last 4, ssk, yo, k2 - > doesn't look like English words, kind of looks like a sentence *if* you > assume a space or comma denotes a that which came before is a word) > > So, in order to handle/fix stuff like that, without training, I'm looking > for tips on how to inspect my hOCR files to verify and, if necessary, > correct the results, that work on linux without running wine. I am looking > into the tools suggested in the "Post OCR Verification and Editing" > conversation, but that poster is on windows, with a different toolchain, > so, not sure all apply to me. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frCE7PR_%3DBPpKKhYfmK1CPpqs4KbLUGEYH-WWkGBtPAEg%40mail.gmail.com.