[tesseract-ocr] Textbook-like format. Correcting improperly recognized text

Misti Hamon Mon, 29 Apr 2024 11:05:47 -0700

Forgive me, I have lots of questions and will be trying to separate out one 
question per conversation (so that those searching later may more easily 
find the answers).


I'm working with scanned images of a textbook like layout - occasional 
drop-caps, text in 2 or occasionally 3 columns that flows around images 
(sometimes an actual square or rectangle, others the image had the 
background removed and the text flows around the subject) and jargon (most 
of the book is English, but there is topic specific jargon, abbreviations 
of the jargon, and, even worse, acronyms and symbols of said jargon), where 
fractions are used, they are in the form of smart fractions (so, something 
like 1/4" uses the space of 2 characters, not 4). Also, the lighting during 
the scan was uneven and the original images were taken at approx 250 dpi. 
There is also tabular data (worst case, I'm fine with the tabular stuff not 
being included in the ocr results).

I've preprocessed the images, including binerization and upscaling to get 
300dpi for tesseract to work with, but the uneven lighting wasn't able to 
be entirely fixed (would need to rescan unless someone knows of a way to 
fix in GIMP, and that is not an option right now) which made binerization 
of some blocks on some pages less successful than others.

That's the background, may need to refer back to it with other questions.

So far (I've tried OEM 0 and 1) results are "ok" but there are errors - 
both high confidence words that are wrong, and low confidence words that 
are actually correct, as well as difficulty with the fractions and orphans 
from the drop caps. Some of the jargon related stuff is iffy too (when 
lighting and binerization is clear, LTSM runs pick most of it up pretty 
well, though). Using a hOCR viewer - ScribeOCR, which I found out about on 
list - isn't going so well, the physical book these images were taken from 
is approximately US Letter sized and scribeocr is "stuck" on showing me the 
whole page, which makes the text too small to actually read (and since I 
have wrong high confidence and correct low confidence, I can't depend on 
the color coding) - if I could read it I could correct there. So, how, 
exactly, does one go about correcting hocr results?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/52f25aba-b7a9-4e37-b070-ce3aa120cd43n%40googlegroups.com.

[tesseract-ocr] Textbook-like format. Correcting improperly recognized text

Reply via email to