[tesseract-ocr] hOCR verification and editing plus non-word characters

Misti Hamon Sun, 24 Mar 2024 11:29:07 -0700

I'm going to preface this with, I haven't actually done an OCR run yet (by 
the time any replies come in, I probably will have, the source image 
editing is almost done).


I'm working with some photoscanned images of knitting related work (so, 
there are some non-word characters and acronyms used, most are still 
English but there are occasional symbols, some standard ascii or unicode, 
others specialty - I should be able to exclude the specialty symbols and 
keep them as an image, or at least I hope so), based on tesseract being a 
"groups of words" based recognition, it sounds like this might produce 
unexpected results?   (example of a line that might show up that could 
cause a problem would be - K2, yo, k2tog, k to last 4, ssk, yo, k2 - 
doesn't look like English words, kind of looks like a sentence *if* you 
assume a space or comma denotes a that which came before is a word)

So, in order to handle/fix stuff like that, without training, I'm looking 
for tips on how to inspect my hOCR files to verify and, if necessary, 
correct the results, that work on linux without running wine. I am looking 
into the tools suggested in the "Post OCR Verification and Editing" 
conversation, but that poster is on windows, with a different toolchain, 
so, not sure all apply to me.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6a64a68e-c3d5-4878-8c74-37be419c54d8n%40googlegroups.com.

[tesseract-ocr] hOCR verification and editing plus non-word characters

Reply via email to