[tesseract-ocr] Using corrected text in second pass

2021-02-18 Thread Graham Seaman
I'm newish to tesseract so this may be a FAQ (though I've looked and its not in the actual FAQs!) - please point me to the right place if it is. My use case: There are lots of pdfs of scanned books around which include moderately good ocr-ed text (eg on archive.org). There are also lots of epub

Re: [tesseract-ocr] Re: Using corrected text in second pass

2021-02-19 Thread Graham Seaman
Thanks Tom - I probably shouldn't have given the Gutenberg example since it introduces extra problems. In my actual process at the moment I have the source scans, OCR output texts, and corrected text files produced by myself, so there are fewer variables to worry about. In particular, page division