For alignment you're probably thinking of Burrows-Wheeler: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform There's a more fully worked, and more topical, example in ReTAS: http://ciir.cs.umass.edu/downloads/ocr-evaluation/ http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=982
All of that deals with linear texts, though. Once you venture into two dimensional space and fixing/redoing page segmentation, you're operating in a much more complex domain. You can see some experimentation that I did with the OCR output of the Oxford English Dictionary here: https://github.com/tfmorris/oed/blob/master/oedabby.py It was years ago, but, as I remember it, I started down the path of merging/splitting existing bounding boxes, and basically ended up deciding that I was going to have to punt and re-segment/layout the entire page from scratch using character positions (and I never tried it, so it might not have worked). The hOCR output should (although I haven't looked recently) mirror the page segmentation output, ie. text blocks in reading order with interspersed graphics blocks for images, etc. That's all fine in the normal case, but if you get lines merged across columns/blocks or widows/orphans from drop caps or antiquated typesetting conventions, sorting things out is much more difficult. In the easy case, the hOCR output should be trivial to follow/match. Good luck! Tom On Friday, February 19, 2021 at 1:57:28 PM UTC-5 gra...@theseamans.net wrote: > Thanks Tom - I probably shouldn't have given the Gutenberg example since > it introduces extra problems. In my actual process at the moment I have > the source scans, OCR output texts, and corrected text files produced by > myself, so there are fewer variables to worry about. In particular, page > divisions, running headers etc can still be there in my corrected text > file. Also, since the text comes from the actual PDF there are no > problems with variant editions etc: if the OCR says 'dog' and my text > says 'cat', then the OCR is wrong and needs correcting. > > So taking Tom's conclusion: > > > I don't think it needs to involve Tesseract since you could do it > > entirely as a post-processing step using the hOCR output and your > > ground truth text. > > Trying to think this through: > > I can try to keep track of the current page in both files just by > counting, and so assume I'm always working within a page. > > For a simple one-column page I guess the process starts with a > text-alignment/best match problem. I have dim memories of there being > standard algorithms for this, and with all the gene sequencing stuff now > presumably there are lots more, and python would be a likely bet for > cookbook style examples. > > Once I have the best fit for alignment of the two texts, I can both > replace incorrect letters in the hOCR, and delete unwanted letters and > their location from the hOCR. But the third possibility seems harder: in > my experience it is quite common for OCR output to miss out whole words. > How would I generate the location information for a word which is > missing from the hOCR? Similarly, if there is a poorly scanned bit at > the edge of a page where the OCR output is just gibberish, how do I know > the locations of the characters to replace them with? Try to interpolate > from the positions of surrounding text I guess, so you would get > locations which are actually slightly off (this would not matter at all > for searching within the pdf, and maybe not much for copy-and-paste?) > > Then what happens with multi-column layout, or text that flows round > image boxes? Can I still use my hypothetical text-alignment algorithm? I > have no experience with hOCR and don't know how the tesseract hOCR > outputter linearizes these things. Are there fixed rules for how the > hOCR data is ordered in the file? Are there any helpful texts about > hOCR? I found a formal grammar which was no help (to me) at all, but > nothing else so far. > > Graham > > > On 19/02/2021 15:44, Tom Morris wrote: > > On Thursday, February 18, 2021 at 3:07:52 PM UTC-5 gra...@theseamans.net > > wrote: > > > > > > There are lots of pdfs of scanned books around which include > > moderately good ocr-ed text (eg on archive.org <http://archive.org>). > > > > > > OCR quality varies widely (even wildly) across scans and vintages of > > OCR, so it's worth checking your "moderately good" assumption for any > > edition/scan that you want to work with. Poor quality OCR will make the > > task impossible > > > > > > There are also lots of epub, text or html books which have been > > created from this ocr output text, manually corrected (eg. > > gutenberg.org <http://gutenberg.org>). > > > > > > Gutenberg (and pgdp) are just "manually corrected" (or at least they > > didn't used to be) due to Gutenberg's "editionless" policy and specific > > editorial decisions made by individual pgdp project coordinators. In the > > same way the OCR noise increases the difficulty of the task, the further > > the pgdp draft drifts from a 1-to-1 transcription, the harder the > > alignment task becomes. > > > > > > There is no feedback loop between the two - the manually corrected > > text is never used to improve the text embedded in the pdf. This > > also applies if I scan books myself and manually correct the > > extracted ocr text - there is no way I know of to generate a pdf > > with fully correct embedded text using my manual corrections. > > > > One way to fix this might be if tesseract could take a manually > > corrected text as a kind of 'hint' file along with the original > > scanned pages, and then do a second pass to generate the final pdf > > version, with fully correct embedded text. Obviously there could be > > problems around keeping the scan processing and the hint text in > > sync, but generally this sounds to me like it should be do-able. > > Would it be? > > > > > > Alignment/synchronization is exactly the crux of the problem. The OCR > > output is text plus bounding box information. In the simple case, with > > good page segmentation, low OCR error rates, predictable pgdp editorial > > decisions (hyphenated words split across line endings closed up, etc), > > it's simply a matter of replacing "the quick brown fox jumped over the > > lazy /dag/" with "the quick brown fox jumped over the lazy *dog*", but > > what if the ground truth says "the quick brown fox jumped over the lazy > > cat" or "the quick fox jumped over the dog"? Is that due to us working > > with a different edition (PG never used to record editions - does it > > now?) or ...? The easy solution would be to only fix isolated errors > > with high confidence replacements, but it's unclear how much that would > > leave unfixed. That would be an interesting analysis. There are a number > > of ancillary issues lurking under the covers like dealing with running > > headers/footers, signature numbers/marks, etc > > > > I think it would be an interesting project, but it wouldn't be trivial. > > I don't think it needs to involve Tesseract since you could do it > > entirely as a post-processing step using the hOCR output and your ground > > truth text. > > > > Tom > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/524203d5-c906-428f-84ae-f6d3fc178815n%40googlegroups.com.