[tesseract-ocr] Using corrected text in second pass

Graham Seaman Thu, 18 Feb 2021 12:07:53 -0800

I'm newish to tesseract so this may be a FAQ (though I've looked and its 
not in the actual FAQs!) - please point me to the right place if it is.


My use case:

There are lots of pdfs of scanned books around which include moderately 
good ocr-ed text (eg on archive.org). There are also lots of epub, text or 
html books which have been created from this ocr output text, manually 
corrected (eg. gutenberg.org). There is no feedback loop between the two - 
the manually corrected text is never used to improve the text embedded in 
the pdf. This also applies if I scan books myself and manually correct the 
extracted ocr text - there is no way I know of to generate a pdf with fully 
correct embedded text using my manual corrections.

One way to fix this might be if tesseract could take a manually corrected 
text as a kind of 'hint' file along with the original scanned pages, and 
then do a second pass to generate the final pdf version, with fully correct 
embedded text.  Obviously there could be problems around keeping the scan 
processing and the hint text in sync, but generally this sounds to me like 
it should be do-able. Would it be? Or is there an existing way to solve the 
same problem? (preferably not trying to edit hocr files!)

Graham

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/adeaeea8-f1ce-4db2-bb83-792f1974e2ffn%40googlegroups.com.

[tesseract-ocr] Using corrected text in second pass

Reply via email to