Thanks Tom - I probably shouldn't have given the Gutenberg example since it introduces extra problems. In my actual process at the moment I have the source scans, OCR output texts, and corrected text files produced by myself, so there are fewer variables to worry about. In particular, page divisions, running headers etc can still be there in my corrected text file. Also, since the text comes from the actual PDF there are no problems with variant editions etc: if the OCR says 'dog' and my text says 'cat', then the OCR is wrong and needs correcting.
So taking Tom's conclusion: > I don't think it needs to involve Tesseract since you could do it > entirely as a post-processing step using the hOCR output and your > ground truth text. Trying to think this through: I can try to keep track of the current page in both files just by counting, and so assume I'm always working within a page. For a simple one-column page I guess the process starts with a text-alignment/best match problem. I have dim memories of there being standard algorithms for this, and with all the gene sequencing stuff now presumably there are lots more, and python would be a likely bet for cookbook style examples. Once I have the best fit for alignment of the two texts, I can both replace incorrect letters in the hOCR, and delete unwanted letters and their location from the hOCR. But the third possibility seems harder: in my experience it is quite common for OCR output to miss out whole words. How would I generate the location information for a word which is missing from the hOCR? Similarly, if there is a poorly scanned bit at the edge of a page where the OCR output is just gibberish, how do I know the locations of the characters to replace them with? Try to interpolate from the positions of surrounding text I guess, so you would get locations which are actually slightly off (this would not matter at all for searching within the pdf, and maybe not much for copy-and-paste?) Then what happens with multi-column layout, or text that flows round image boxes? Can I still use my hypothetical text-alignment algorithm? I have no experience with hOCR and don't know how the tesseract hOCR outputter linearizes these things. Are there fixed rules for how the hOCR data is ordered in the file? Are there any helpful texts about hOCR? I found a formal grammar which was no help (to me) at all, but nothing else so far. Graham On 19/02/2021 15:44, Tom Morris wrote: > On Thursday, February 18, 2021 at 3:07:52 PM UTC-5 gra...@theseamans.net > wrote: > > > There are lots of pdfs of scanned books around which include > moderately good ocr-ed text (eg on archive.org <http://archive.org>). > > > OCR quality varies widely (even wildly) across scans and vintages of > OCR, so it's worth checking your "moderately good" assumption for any > edition/scan that you want to work with. Poor quality OCR will make the > task impossible > > > There are also lots of epub, text or html books which have been > created from this ocr output text, manually corrected (eg. > gutenberg.org <http://gutenberg.org>). > > > Gutenberg (and pgdp) are just "manually corrected" (or at least they > didn't used to be) due to Gutenberg's "editionless" policy and specific > editorial decisions made by individual pgdp project coordinators. In the > same way the OCR noise increases the difficulty of the task, the further > the pgdp draft drifts from a 1-to-1 transcription, the harder the > alignment task becomes. > > > There is no feedback loop between the two - the manually corrected > text is never used to improve the text embedded in the pdf. This > also applies if I scan books myself and manually correct the > extracted ocr text - there is no way I know of to generate a pdf > with fully correct embedded text using my manual corrections. > > One way to fix this might be if tesseract could take a manually > corrected text as a kind of 'hint' file along with the original > scanned pages, and then do a second pass to generate the final pdf > version, with fully correct embedded text. Obviously there could be > problems around keeping the scan processing and the hint text in > sync, but generally this sounds to me like it should be do-able. > Would it be? > > > Alignment/synchronization is exactly the crux of the problem. The OCR > output is text plus bounding box information. In the simple case, with > good page segmentation, low OCR error rates, predictable pgdp editorial > decisions (hyphenated words split across line endings closed up, etc), > it's simply a matter of replacing "the quick brown fox jumped over the > lazy /dag/" with "the quick brown fox jumped over the lazy *dog*", but > what if the ground truth says "the quick brown fox jumped over the lazy > cat" or "the quick fox jumped over the dog"? Is that due to us working > with a different edition (PG never used to record editions - does it > now?) or ...? The easy solution would be to only fix isolated errors > with high confidence replacements, but it's unclear how much that would > leave unfixed. That would be an interesting analysis. There are a number > of ancillary issues lurking under the covers like dealing with running > headers/footers, signature numbers/marks, etc > > I think it would be an interesting project, but it wouldn't be trivial. > I don't think it needs to involve Tesseract since you could do it > entirely as a post-processing step using the hOCR output and your ground > truth text. > > Tom > > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/4CPxCBbiOt0/unsubscribe > <https://groups.google.com/d/topic/tesseract-ocr/4CPxCBbiOt0/unsubscribe>. > To unsubscribe from this group and all its topics, send an email to > tesseract-ocr+unsubscr...@googlegroups.com > <mailto:tesseract-ocr+unsubscr...@googlegroups.com>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/5fc0ee4a-7a9b-40f9-91e0-57ec7cb54bd3n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/5fc0ee4a-7a9b-40f9-91e0-57ec7cb54bd3n%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6d94ea12-33ba-fa5f-a551-8fdf05e7b574%40theseamans.net.