For alignment you're probably thinking of
Burrows-Wheeler: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
There's a more fully worked, and more topical, example in ReTAS:
http://ciir.cs.umass.edu/downloads/ocr-evaluation/
http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id
Thanks Tom - I probably shouldn't have given the Gutenberg example since
it introduces extra problems. In my actual process at the moment I have
the source scans, OCR output texts, and corrected text files produced by
myself, so there are fewer variables to worry about. In particular, page
division
2 matches
Mail list logo