On Friday, January 19, 2024 at 8:44:13 AM UTC-5 Lars Aronsson wrote: How come? Is it the unusual line spacing that makes Tesseract confused? Or the dotted line? Why does it fill in letters where there should be word-separating spaces?
I think the simplest and most likely explanation is that there wasn't any text like that in its training set. You might be able to improve the situation by creating ground truth text and images for table of content lines and fine tuning the Danish model, either to create an enhanced model, if it doesn't degrade performance on normal text too much or to create a separate dan_toc model which can be used on pages which are identified as tables of contents pages. As an aside, it also looks like you're got some page segmentation issues since the last line on the page ("Literatur-Fortegnelse til Bronzealderen") is being output at the top. This might be something you could clean up by post-processing the HOCR output or by doing the page segmentation yourself. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f15cc877-d3ac-459f-b23c-76acc9ba171en%40googlegroups.com.