I'm newish to tesseract so this may be a FAQ (though I've looked and its not in the actual FAQs!) - please point me to the right place if it is.
My use case: There are lots of pdfs of scanned books around which include moderately good ocr-ed text (eg on archive.org). There are also lots of epub, text or html books which have been created from this ocr output text, manually corrected (eg. gutenberg.org). There is no feedback loop between the two - the manually corrected text is never used to improve the text embedded in the pdf. This also applies if I scan books myself and manually correct the extracted ocr text - there is no way I know of to generate a pdf with fully correct embedded text using my manual corrections. One way to fix this might be if tesseract could take a manually corrected text as a kind of 'hint' file along with the original scanned pages, and then do a second pass to generate the final pdf version, with fully correct embedded text. Obviously there could be problems around keeping the scan processing and the hint text in sync, but generally this sounds to me like it should be do-able. Would it be? Or is there an existing way to solve the same problem? (preferably not trying to edit hocr files!) Graham -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/adeaeea8-f1ce-4db2-bb83-792f1974e2ffn%40googlegroups.com.