[tesseract-ocr] Re: Textbook-like format. Correcting improperly recognized text

Jeremiah Mon, 29 Apr 2024 11:36:03 -0700

Regarding proofreading with Scribe OCR <https://scribeocr.com/>, it is 
definitely possible to zoom in.  The controls are virtually identical to 
popular document viewer programs like Acrobat.  You can zoom in on the 
current location of the mouse using Control + Mouse Wheel, scroll using the 
mouse wheel, and pan in all directions using the middle mouse button.

Regarding confidence metrics, unfortunately, confidence metrics reported by 
Tesseract are extremely unreliable on the level of individual words. This 
is unfortunately not fixable, and is not even unique to Tesseract. I 
benchmarked Abbyy (paid/commercial OCR program) at one point and found that 
the vast majority of low-confidence words were correct, and the the vast 
majority of incorrect words were high-confidence.  Metrics from OCR engines 
can be useful on a less granular level--a page with average confidence 0.95 
will be significantly higher-quality than a page with average confidence of 
0.80--however I don't think accurate metrics are possible on the word 
level. None of these programs have any robust way to evaluate themselves, 
so the confidence metrics are built using some internal metrics from the 
recognition process.

If having more accurate confidence metrics is important, one option is to 
use the built-in "Recognize Text" feature of Scribe OCR rather than 
uploading data from Tesseract.  This feature runs Tesseract Legacy and 
Tesseract LSTM, compares the results, and marks words that agree across 
versions as "high confidence" and words that disagree across versions as 
"low confidence."  This method is significantly more robust than using the 
confidence metrics from Tesseract, and generally flags >90% of incorrect 
text as low confidence.  Note that Scribe OCR uses (by default) a forked 
version of Tesseract, so recognition results may differ.

Answering questions specific to your document would require providing some 
of the image(s) at issue. 
On Monday, April 29, 2024 at 11:05:43 AM UTC-7 misti...@gmail.com wrote:

> Forgive me, I have lots of questions and will be trying to separate out 
> one question per conversation (so that those searching later may more 
> easily find the answers).
>
> I'm working with scanned images of a textbook like layout - occasional 
> drop-caps, text in 2 or occasionally 3 columns that flows around images 
> (sometimes an actual square or rectangle, others the image had the 
> background removed and the text flows around the subject) and jargon (most 
> of the book is English, but there is topic specific jargon, abbreviations 
> of the jargon, and, even worse, acronyms and symbols of said jargon), where 
> fractions are used, they are in the form of smart fractions (so, something 
> like 1/4" uses the space of 2 characters, not 4). Also, the lighting during 
> the scan was uneven and the original images were taken at approx 250 dpi. 
> There is also tabular data (worst case, I'm fine with the tabular stuff not 
> being included in the ocr results).
>
> I've preprocessed the images, including binerization and upscaling to get 
> 300dpi for tesseract to work with, but the uneven lighting wasn't able to 
> be entirely fixed (would need to rescan unless someone knows of a way to 
> fix in GIMP, and that is not an option right now) which made binerization 
> of some blocks on some pages less successful than others.
>
> That's the background, may need to refer back to it with other questions.
>
> So far (I've tried OEM 0 and 1) results are "ok" but there are errors - 
> both high confidence words that are wrong, and low confidence words that 
> are actually correct, as well as difficulty with the fractions and orphans 
> from the drop caps. Some of the jargon related stuff is iffy too (when 
> lighting and binerization is clear, LTSM runs pick most of it up pretty 
> well, though). Using a hOCR viewer - ScribeOCR, which I found out about on 
> list - isn't going so well, the physical book these images were taken from 
> is approximately US Letter sized and scribeocr is "stuck" on showing me the 
> whole page, which makes the text too small to actually read (and since I 
> have wrong high confidence and correct low confidence, I can't depend on 
> the color coding) - if I could read it I could correct there. So, how, 
> exactly, does one go about correcting hocr results?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/772ae968-ab23-46ec-ae89-81b9c29602e5n%40googlegroups.com.

[tesseract-ocr] Re: Textbook-like format. Correcting improperly recognized text

Reply via email to