There is nothing like 100% OCR accuracy. Simply from a bad image you can
not get good results (maybe google vision is close ;-), but it is a
different story).
Our best experiences are collected at docs (
https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html).

For different images/problems you need different solutions. E.g. in the
case of historical documents you will need to focus on thresholding, in the
case of natural scenes on text detection, in the case of invoices, document
layout processing...

There is much research on this: some older papers are available on
academia.edu E.g.
https://www.academia.edu/4790395/PhotoOCR_Reading_Text_in_Uncontrolled_Conditions
https://www.academia.edu/2793675/End_to_end_scene_text_recognition
https://www.academia.edu/6030087/Tex_Binarization_In_Color_Documents
https://www.academia.edu/39052965/OCR_to_read_embossed_text_from_Credit_Debit_card
https://www.academia.edu/1171645/A_variational_approach_to_degraded_document_enhancement
https://www.academia.edu/19957320/Multi_spectral_document_image_binarization_using_image_fusion_and_background_subtraction_techniques
https://www.academia.edu/1171639/Low_quality_document_image_modeling_and_enhancement


Zdenko


st 31. 8. 2022 o 17:10 Adrian Paul Ciobanita <adrian.cioban...@gmail.com>
napísal(a):

> Can you recomend tutorials, or books avout how to do image pre-processing
> effectively and efficiently?
>
> Do we need to do different types of image pre-processing for each image?
> If we have 100+ images, how do we ensure that the pre-processing is helping
> the prediction accuracy 100%?
>
> On Wed, Aug 31, 2022, 18:04 Zdenko Podobny <zde...@gmail.com> wrote:
>
>> Trained data from tesseract 5 are compatible with 4, so definitely I
>> would suggest using the latest tesseract version for training - there was a
>> lot of bug fixing and speed improvements.
>>
>> IMO tesseract training has never been easy. I always suggest focusing on
>> image preprocessing rather than training.
>> Following an easy looking training tutorial could also mean you are on
>> the wrong path (=> you waste your time and it increases your frustration).
>> For example:Tesseract 4 has 2 engines LSTM and legacy engine. In this
>> particular video, for which engine is that training? For hints see[1]. Did
>> you plan to train that engine or other one?
>>
>> [1]
>> https://tesseract-ocr.github.io/tessdoc/tess5/TrainingTesseract-5.html#overview-of-training-process
>>
>> Zdenko
>>
>>
>> st 31. 8. 2022 o 15:40 John Alway <jal...@gmail.com> napísal(a):
>>
>>> "First of all: if you follow any tutorial on internet - report the
>>> problem to the author of the tutorial."
>>> Next: use official documentation for training. I see there are a bunch
>>> of folks just "generating content" - to gain an audience. Without insight
>>> and therefore also without support, using old/outdated information..."
>>>
>>>   People are trying to find a nice, easy tutorial to help them get
>>> through the forest.   I think that's the bottom line.   Thanks for the link.
>>>
>>> 'Tesseract 4 was released 29 Oct 2018. Almost 4 year ago! The recent
>>> tesseract version is 5.2 and training process was also improved:
>>> https://github.com/tesseract-ocr/tesstrain";
>>>
>>>   I understand this, but I'm using C# .Net, and I don't think version 5
>>> is available in C#.  Unless I'm mistaken?  There are costly packages, such
>>> as IronOcr which uses tesseract 5, but there is no way I can take that
>>> route.
>>>
>>>   Regards,
>>>  ...John
>>>
>>>
>>>
>>> [image: width=]
>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>> Virus-free.www.avg.com
>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>> <#m_-3692493932810461313_m_4818300399805832802_m_6502694561301915279_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>
>>> On Wed, Aug 31, 2022 at 3:27 AM Zdenko Podobny <zde...@gmail.com> wrote:
>>>
>>>> First of all: if you follow any tutorial on internet - report the
>>>> problem to the author of the tutorial.
>>>> Next: use official documentation for training. I see there are a bunch
>>>> of folks just "generating content" - to gain an audience. Without insight
>>>> and therefore also without support, using old/outdated information...
>>>> Tesseract 4 was released 29 Oct 2018. Almost 4 year ago! The recent
>>>> tesseract version is 5.2 and training process was also improved:
>>>> https://github.com/tesseract-ocr/tesstrain
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> st 31. 8. 2022 o 0:18 John Alway <jal...@gmail.com> napísal(a):
>>>>
>>>>> Hello,
>>>>>
>>>>> I've been following a tutorial on youtube titled "Tesseract OCR -
>>>>> Lesson 2: Training Tesseract for new font" here:
>>>>> https://www.youtube.com/watch?v=1v8BPw0Dn0I&ab_channel=TheCode
>>>>>
>>>>> I'm using tesseract 4.0 on Window 10.
>>>>>
>>>>> I went through the steps he used, and everything seems to go smoothly
>>>>> until I get to the actual training.    When I run "mftraining" the program
>>>>> hangs. It seems to get stuck and doesn't indicate why are what it's doing.
>>>>>
>>>>> I'm using a set of fonts in an image. I have the full alphabet upper
>>>>> and lower case and the numbers 0 to 9 on the png image.   I've attached 
>>>>> the
>>>>> image.  Unlike him, I'm using the English.  I don't know the font, so I'm
>>>>> just calling it tiktok to give it a name.    My training file is called 
>>>>> *eng.tiktok.exp0.tr
>>>>> <http://eng.tiktok.exp0.tr>  *
>>>>>
>>>>> I used* jTessBoxEditor* to correct mistakes and set the box sizes and
>>>>> positions precisely.
>>>>>
>>>>>
>>>>> When I run this command:
>>>>>  *mftraining -F font_properties -U unicharset -O eng.unicharset
>>>>> eng.tiktok.exp0.tr <http://eng.tiktok.exp0.tr>*
>>>>>
>>>>> The program just hangs. I've waited over twenty minutes.
>>>>>
>>>>> Should I wait longer?   What could cause it to hang?
>>>>>
>>>>>
>>>>>
>>>>> Thanks!
>>>>> ...John
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/534c3f74-420b-4c96-83dd-609bcb002f81n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/534c3f74-420b-4c96-83dd-609bcb002f81n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w5MX1XchQp6jfu2Vz06zWp82HxbDHrgp7%2BQ_Neh%2BDeug%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w5MX1XchQp6jfu2Vz06zWp82HxbDHrgp7%2BQ_Neh%2BDeug%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAN7TTkEYLexQwmN9y3qO6cwqGoW72HoaEj8XksAgTwb3qVSNPA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAN7TTkEYLexQwmN9y3qO6cwqGoW72HoaEj8XksAgTwb3qVSNPA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wt1fjKy66uZ9-sEAqUkEwagLAZhFxcDV_8QLuPgk%3DnvQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wt1fjKy66uZ9-sEAqUkEwagLAZhFxcDV_8QLuPgk%3DnvQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CADB4xcjLwGXFqNFnxwFVWcOXAtztmd3aLrs4G6AnzXawWYwf_A%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CADB4xcjLwGXFqNFnxwFVWcOXAtztmd3aLrs4G6AnzXawWYwf_A%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8woFXFYfo9zHQ0Vj%3D0JrqJPSWphq7C7WGF9YLUOeHwDFg%40mail.gmail.com.

Reply via email to