I'm running a standard Ubuntu Linux with Tesseract 5.3.0 and it gives very good results in almost every situation, with one strange exception: Tables of contents.
Here is a typical page from a book in Danish language, printed in 1897, https://runeberg.org/voroldtid/0344.html Below the image is the raw OCR text from tesseract -l dan using for input the full resolution JPEG image (2464 x 3610 pixels). The OCR text has some initial garbage, but then the text follows in near perfect quality. Here is the table of content from the same book, https://runeberg.org/voroldtid/0011.html Below the image is the OCR text after manual proofreading, but the original raw OCR output from Tesseract is seen here: https://runeberg.org/rc.pl?action=show&version=1&src=voroldtid/0011 A typical line there reads: URMMDEnFs]dretStenaldersåBopladserikereen ER 5 ANSE URE Instead of the desired: I. Den ældre Stenalders Bopladser ......... 7. How come? Is it the unusual line spacing that makes Tesseract confused? Or the dotted line? Why does it fill in letters where there should be word-separating spaces? -- Lars Aronsson (l...@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/ -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/04d557db-e264-46a6-aafd-858554496e54%40aronsson.se.