[tesseract-ocr] Strange OCR results from table of contents

Lars Aronsson Fri, 19 Jan 2024 05:44:11 -0800

I'm running a standard Ubuntu Linux with Tesseract 5.3.0 and
it gives very good results in almost every situation, with
one strange exception: Tables of contents.


Here is a typical page from a book in Danish language, printed in 1897,
https://runeberg.org/voroldtid/0344.html

Below the image is the raw OCR text from tesseract -l dan
using for input the full resolution JPEG image (2464 x 3610 pixels).

The OCR text has some initial garbage, but then the text
follows in near perfect quality.

Here is the table of content from the same book,
https://runeberg.org/voroldtid/0011.html

Below the image is the OCR text after manual proofreading,
but the original raw OCR output from Tesseract is seen here:

https://runeberg.org/rc.pl?action=show&version=1&src=voroldtid/0011

A typical line there reads:

URMMDEnFs]dretStenaldersåBopladserikereen ER 5 ANSE URE

Instead of the desired:

I. Den ældre Stenalders Bopladser ......... 7.

How come? Is it the unusual line spacing that makes Tesseract
confused? Or the dotted line? Why does it fill in letters
where there should be word-separating spaces?


--
  Lars Aronsson (l...@aronsson.se)
  Project Runeberg - free Nordic literature - http://runeberg.org/


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/04d557db-e264-46a6-aafd-858554496e54%40aronsson.se.

[tesseract-ocr] Strange OCR results from table of contents

Reply via email to