I'm newish to tesseract so this may be a FAQ (though I've looked and its
not in the actual FAQs!) - please point me to the right place if it is.
My use case:
There are lots of pdfs of scanned books around which include moderately
good ocr-ed text (eg on archive.org). There are also lots of epub
Thanks Tom - I probably shouldn't have given the Gutenberg example since
it introduces extra problems. In my actual process at the moment I have
the source scans, OCR output texts, and corrected text files produced by
myself, so there are fewer variables to worry about. In particular, page
division
2 matches
Mail list logo