On Thursday, February 18, 2021 at 3:07:52 PM UTC-5 gra...@theseamans.net wrote:
> > There are lots of pdfs of scanned books around which include moderately > good ocr-ed text (eg on archive.org). > OCR quality varies widely (even wildly) across scans and vintages of OCR, so it's worth checking your "moderately good" assumption for any edition/scan that you want to work with. Poor quality OCR will make the task impossible > There are also lots of epub, text or html books which have been created > from this ocr output text, manually corrected (eg. gutenberg.org). > Gutenberg (and pgdp) are just "manually corrected" (or at least they didn't used to be) due to Gutenberg's "editionless" policy and specific editorial decisions made by individual pgdp project coordinators. In the same way the OCR noise increases the difficulty of the task, the further the pgdp draft drifts from a 1-to-1 transcription, the harder the alignment task becomes. > There is no feedback loop between the two - the manually corrected text is > never used to improve the text embedded in the pdf. This also applies if I > scan books myself and manually correct the extracted ocr text - there is no > way I know of to generate a pdf with fully correct embedded text using my > manual corrections. > > One way to fix this might be if tesseract could take a manually corrected > text as a kind of 'hint' file along with the original scanned pages, and > then do a second pass to generate the final pdf version, with fully correct > embedded text. Obviously there could be problems around keeping the scan > processing and the hint text in sync, but generally this sounds to me like > it should be do-able. Would it be? > Alignment/synchronization is exactly the crux of the problem. The OCR output is text plus bounding box information. In the simple case, with good page segmentation, low OCR error rates, predictable pgdp editorial decisions (hyphenated words split across line endings closed up, etc), it's simply a matter of replacing "the quick brown fox jumped over the lazy *dag*" with "the quick brown fox jumped over the lazy *dog*", but what if the ground truth says "the quick brown fox jumped over the lazy cat" or "the quick fox jumped over the dog"? Is that due to us working with a different edition (PG never used to record editions - does it now?) or ...? The easy solution would be to only fix isolated errors with high confidence replacements, but it's unclear how much that would leave unfixed. That would be an interesting analysis. There are a number of ancillary issues lurking under the covers like dealing with running headers/footers, signature numbers/marks, etc I think it would be an interesting project, but it wouldn't be trivial. I don't think it needs to involve Tesseract since you could do it entirely as a post-processing step using the hOCR output and your ground truth text. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5fc0ee4a-7a9b-40f9-91e0-57ec7cb54bd3n%40googlegroups.com.