On Thursday, February 18, 2021 at 3:07:52 PM UTC-5 gra...@theseamans.net 
wrote:

>
> There are lots of pdfs of scanned books around which include moderately 
> good ocr-ed text (eg on archive.org). 
>

OCR quality varies widely (even wildly) across scans and vintages of OCR, 
so it's worth checking your "moderately good" assumption for any 
edition/scan that you want to work with. Poor quality OCR will make the 
task impossible
 

> There are also lots of epub, text or html books which have been created 
> from this ocr output text, manually corrected (eg. gutenberg.org). 
>

Gutenberg (and pgdp) are just "manually corrected" (or at least they didn't 
used to be) due to Gutenberg's "editionless" policy and specific editorial 
decisions made by individual pgdp project coordinators. In the same way the 
OCR noise increases the difficulty of the task, the further the pgdp draft 
drifts from a 1-to-1 transcription, the harder the alignment task becomes.
 

> There is no feedback loop between the two - the manually corrected text is 
> never used to improve the text embedded in the pdf. This also applies if I 
> scan books myself and manually correct the extracted ocr text - there is no 
> way I know of to generate a pdf with fully correct embedded text using my 
> manual corrections.
>
> One way to fix this might be if tesseract could take a manually corrected 
> text as a kind of 'hint' file along with the original scanned pages, and 
> then do a second pass to generate the final pdf version, with fully correct 
> embedded text.  Obviously there could be problems around keeping the scan 
> processing and the hint text in sync, but generally this sounds to me like 
> it should be do-able. Would it be?
>

Alignment/synchronization is exactly the crux of the problem. The OCR 
output is text plus bounding box information. In the simple case, with good 
page segmentation, low OCR error rates, predictable pgdp editorial 
decisions (hyphenated words split across line endings closed up, etc), it's 
simply a matter of replacing "the quick brown fox jumped over the lazy *dag*" 
with "the quick brown fox jumped over the lazy *dog*", but what if the 
ground truth says "the quick brown fox jumped over the lazy cat" or "the 
quick fox jumped over the dog"? Is that due to us working with a different 
edition (PG never used to record editions - does it now?) or ...? The easy 
solution would be to only fix isolated errors with high confidence 
replacements, but it's unclear how much that would leave unfixed. That 
would be an interesting analysis. There are a number of ancillary issues 
lurking under the covers like dealing with running headers/footers, 
signature numbers/marks, etc 

I think it would be an interesting project, but it wouldn't be trivial. I 
don't think it needs to involve Tesseract since you could do it entirely as 
a post-processing step using the hOCR output and your ground truth text.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5fc0ee4a-7a9b-40f9-91e0-57ec7cb54bd3n%40googlegroups.com.

Reply via email to