Re: [tesseract-ocr] Re: Using corrected text in second pass

2021-02-21 Thread Tom Morris
For alignment you're probably thinking of Burrows-Wheeler: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform There's a more fully worked, and more topical, example in ReTAS: http://ciir.cs.umass.edu/downloads/ocr-evaluation/ http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id

Re: [tesseract-ocr] Re: Using corrected text in second pass

2021-02-19 Thread Graham Seaman
Thanks Tom - I probably shouldn't have given the Gutenberg example since it introduces extra problems. In my actual process at the moment I have the source scans, OCR output texts, and corrected text files produced by myself, so there are fewer variables to worry about. In particular, page division