Re: [tesseract-ocr] hOCR verification and editing plus non-word characters

Ger Hobbelt Mon, 25 Mar 2024 04:50:28 -0700

In your scenario, I would check performance of both modern lstm (v4/v5
engine) and old "classic" v3 OCR engine in tesseract. Just for completeness
sake; first tests would be in separate runs so I'ld be able to check the
output quality of both runs into HOCR format. (2 separate runs so I don't
have to bother within tesseract internal heuristic to "pick the best one"
and only dump that one: if I were you I'ld want to see both processes'
performance and decide what to do after that.


Postprocessing is akin to "fixing it in the mix": you only do that when all
other options have been depleted.


On Sun, 24 Mar 2024, 19:29 Misti Hamon, <mistiha...@gmail.com> wrote:

> I'm going to preface this with, I haven't actually done an OCR run yet (by
> the time any replies come in, I probably will have, the source image
> editing is almost done).
>
> I'm working with some photoscanned images of knitting related work (so,
> there are some non-word characters and acronyms used, most are still
> English but there are occasional symbols, some standard ascii or unicode,
> others specialty - I should be able to exclude the specialty symbols and
> keep them as an image, or at least I hope so), based on tesseract being a
> "groups of words" based recognition, it sounds like this might produce
> unexpected results?   (example of a line that might show up that could
> cause a problem would be - K2, yo, k2tog, k to last 4, ssk, yo, k2 -
> doesn't look like English words, kind of looks like a sentence *if* you
> assume a space or comma denotes a that which came before is a word)
>
> So, in order to handle/fix stuff like that, without training, I'm looking
> for tips on how to inspect my hOCR files to verify and, if necessary,
> correct the results, that work on linux without running wine. I am looking
> into the tools suggested in the "Post OCR Verification and Editing"
> conversation, but that poster is on windows, with a different toolchain,
> so, not sure all apply to me.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frCE7PR_%3DBPpKKhYfmK1CPpqs4KbLUGEYH-WWkGBtPAEg%40mail.gmail.com.

Re: [tesseract-ocr] hOCR verification and editing plus non-word characters

Reply via email to