[tesseract-ocr] Re: Creating Starter Traineddata

2024-01-19 Thread Simon
Here is a link to the Website of Uni Mannheim: COMBINE_LANG_MODEL - generate starter traineddata Unfortunately the command doesn't create any files and after running the command I don't get any Feedback on why the co

[tesseract-ocr] Strange OCR results from table of contents

2024-01-19 Thread Lars Aronsson
I'm running a standard Ubuntu Linux with Tesseract 5.3.0 and it gives very good results in almost every situation, with one strange exception: Tables of contents. Here is a typical page from a book in Danish language, printed in 1897, https://runeberg.org/voroldtid/0344.html Below the image is t

[tesseract-ocr] Re: Creating Starter Traineddata

2024-01-19 Thread Simon
Ok somehow I had "no entry point found" errors in the dll files. Reinstallation of Tesseract solved the Problem. Now I encounter another interesting Problem. combine_lang_model --input_unicharset C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/Latin.unicharset --script_dir C:/Us

Re: [tesseract-ocr] Re: Creating Starter Traineddata

2024-01-19 Thread Dellu Bw
Yes, you need to add them before you create the starter model. You can edit the Latin.unicarset before you run the combine command. On Fri, Jan 19, 2024, 5:27 PM Simon wrote: > Ok somehow I had "no entry point found" errors in the dll files. > Reinstallation of Tesseract solved the Problem. > >

[tesseract-ocr] Re: Creating Starter Traineddata

2024-01-19 Thread Tom Morris
On Thursday, January 18, 2024 at 5:11:52 AM UTC-5 smon...@gmail.com wrote: In general the instructions on https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters say that you have to make a starter traineddata from the unicharset, but is thi

[tesseract-ocr] Re: Strange OCR results from table of contents

2024-01-19 Thread Tom Morris
On Friday, January 19, 2024 at 8:44:13 AM UTC-5 Lars Aronsson wrote: How come? Is it the unusual line spacing that makes Tesseract confused? Or the dotted line? Why does it fill in letters where there should be word-separating spaces? I think the simplest and most likely explanation is that t