[tesseract-ocr] Re: Strange OCR results from table of contents

Tom Morris Fri, 19 Jan 2024 07:44:08 -0800

On Friday, January 19, 2024 at 8:44:13 AM UTC-5 Lars Aronsson wrote:

How come? Is it the unusual line spacing that makes Tesseract 
confused? Or the dotted line? Why does it fill in letters 
where there should be word-separating spaces?



 I think the simplest and most likely explanation is that there wasn't any 
text like that in its training set. You might be able to improve the 
situation by creating ground truth text and images for table of content 
lines and fine tuning the Danish model, either to create an enhanced model, 
if it doesn't degrade performance on normal text too much or to create a 
separate dan_toc model which can be used on pages which are identified as 
tables of contents pages.

As an aside, it also looks like you're got some page segmentation issues 
since the last line on the page ("Literatur-Fortegnelse til Bronzealderen") 
is being output at the top. This might be something you could clean up by 
post-processing the HOCR output or by doing the page segmentation yourself.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f15cc877-d3ac-459f-b23c-76acc9ba171en%40googlegroups.com.

[tesseract-ocr] Re: Strange OCR results from table of contents

Reply via email to