Re: [tesseract-ocr] Re: Textbook-like format. Correcting improperly recognized text

Jeremiah Wed, 01 May 2024 22:52:11 -0700

I updated the desktop version of Scribe OCR <https://scribeocr.com/> to 
have zoom buttons, and removed the behavior where changing pages resets the 
zoom.  Therefore, it should be possible to edit all pages after zooming in 
once, which should make this less of an issue.


I am not aware of any post-processing program that would fix the issues 
described without using OCR or manual review. 
On Monday, April 29, 2024 at 1:03:16 PM UTC-7 misti...@gmail.com wrote:

> "Regarding proofreading with Scribe OCR, it is definitely possible to zoom 
> in. The controls are virtually identical to popular document viewer 
> programs like Acrobat. You can zoom in on the current location of the mouse 
> using Control + Mouse Wheel, scroll using the mouse wheel, and pan in all 
> directions using the middle mouse button."
>
> This was helpful, sort of. I'm on a laptop, with a gesture capable 
> TouchPad and gesture capable touch screen, zooming using gestures did not 
> work (looks more like a OS settings problem I'll have to investigate), but 
> I did pull out an actual mouse and was able to get zoom working that way, 
> so thank you. A request? If possible, could a "fit width" and "fit page" 
> button be added instead of being dependent on a real mouse to get at least 
> some zoom? (Scroll and pan work fine via TouchPad and touch screen)
>
> I'll go through all my images and see if I can find a single page that has 
> most of the issues so I'm not sending several, might take a few days. The 
> main crux of my question is, though, is there a way to post-process "fix" 
> things like missed characters, drop-cap related orphans, commas that are 
> read as periods regardless of how good your input images are, "smart 
> fractions" and any other problems that can't be fixed by tweaking the 
> command used to invoke tesseract? (Neither legacy nor ltsm do well with the 
> drop caps or smart fractions, so running ScribeOCR's recognize would help 
> those anyway, even if it fixes everything else) I do have questions about 
> tweaking the command as well, just haven't asked them yet
>
> On Mon, Apr 29, 2024, 12:36 Jeremiah <jeremia...@gmail.com> wrote:
>
>> Regarding proofreading with Scribe OCR <https://scribeocr.com/>, it is 
>> definitely possible to zoom in.  The controls are virtually identical to 
>> popular document viewer programs like Acrobat.  You can zoom in on the 
>> current location of the mouse using Control + Mouse Wheel, scroll using the 
>> mouse wheel, and pan in all directions using the middle mouse button.
>>
>> Regarding confidence metrics, unfortunately, confidence metrics reported 
>> by Tesseract are extremely unreliable on the level of individual words. 
>> This is unfortunately not fixable, and is not even unique to Tesseract. I 
>> benchmarked Abbyy (paid/commercial OCR program) at one point and found that 
>> the vast majority of low-confidence words were correct, and the the vast 
>> majority of incorrect words were high-confidence.  Metrics from OCR engines 
>> can be useful on a less granular level--a page with average confidence 0.95 
>> will be significantly higher-quality than a page with average confidence of 
>> 0.80--however I don't think accurate metrics are possible on the word 
>> level. None of these programs have any robust way to evaluate themselves, 
>> so the confidence metrics are built using some internal metrics from the 
>> recognition process.
>>
>> If having more accurate confidence metrics is important, one option is to 
>> use the built-in "Recognize Text" feature of Scribe OCR rather than 
>> uploading data from Tesseract.  This feature runs Tesseract Legacy and 
>> Tesseract LSTM, compares the results, and marks words that agree across 
>> versions as "high confidence" and words that disagree across versions as 
>> "low confidence."  This method is significantly more robust than using the 
>> confidence metrics from Tesseract, and generally flags >90% of incorrect 
>> text as low confidence.  Note that Scribe OCR uses (by default) a forked 
>> version of Tesseract, so recognition results may differ.
>>
>> Answering questions specific to your document would require providing 
>> some of the image(s) at issue. 
>> On Monday, April 29, 2024 at 11:05:43 AM UTC-7 misti...@gmail.com wrote:
>>
>>> Forgive me, I have lots of questions and will be trying to separate out 
>>> one question per conversation (so that those searching later may more 
>>> easily find the answers).
>>>
>>> I'm working with scanned images of a textbook like layout - occasional 
>>> drop-caps, text in 2 or occasionally 3 columns that flows around images 
>>> (sometimes an actual square or rectangle, others the image had the 
>>> background removed and the text flows around the subject) and jargon (most 
>>> of the book is English, but there is topic specific jargon, abbreviations 
>>> of the jargon, and, even worse, acronyms and symbols of said jargon), where 
>>> fractions are used, they are in the form of smart fractions (so, something 
>>> like 1/4" uses the space of 2 characters, not 4). Also, the lighting during 
>>> the scan was uneven and the original images were taken at approx 250 dpi. 
>>> There is also tabular data (worst case, I'm fine with the tabular stuff not 
>>> being included in the ocr results).
>>>
>>> I've preprocessed the images, including binerization and upscaling to 
>>> get 300dpi for tesseract to work with, but the uneven lighting wasn't able 
>>> to be entirely fixed (would need to rescan unless someone knows of a way to 
>>> fix in GIMP, and that is not an option right now) which made binerization 
>>> of some blocks on some pages less successful than others.
>>>
>>> That's the background, may need to refer back to it with other questions.
>>>
>>> So far (I've tried OEM 0 and 1) results are "ok" but there are errors - 
>>> both high confidence words that are wrong, and low confidence words that 
>>> are actually correct, as well as difficulty with the fractions and orphans 
>>> from the drop caps. Some of the jargon related stuff is iffy too (when 
>>> lighting and binerization is clear, LTSM runs pick most of it up pretty 
>>> well, though). Using a hOCR viewer - ScribeOCR, which I found out about on 
>>> list - isn't going so well, the physical book these images were taken from 
>>> is approximately US Letter sized and scribeocr is "stuck" on showing me the 
>>> whole page, which makes the text too small to actually read (and since I 
>>> have wrong high confidence and correct low confidence, I can't depend on 
>>> the color coding) - if I could read it I could correct there. So, how, 
>>> exactly, does one go about correcting hocr results?
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/772ae968-ab23-46ec-ae89-81b9c29602e5n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/772ae968-ab23-46ec-ae89-81b9c29602e5n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/60322953-8006-4e57-a8d4-7689e8d73191n%40googlegroups.com.

Re: [tesseract-ocr] Re: Textbook-like format. Correcting improperly recognized text

Reply via email to